BlakeOrth commented on PR #17232:
URL: https://github.com/apache/datafusion/pull/17232#issuecomment-3215970925

   @alamb 
   > I know this is a lot. I am sorry for the delayed feedback. Please let me 
know what you think.
   
   No worries on the delay, I can absolutely fill in the extra blanks. I just 
have a couple of quick questions on how to proceed.
   
   > 1. More tests (see below)
   
   The `sqllogic` test I added actually does work on a hive partitioned table. 
I setup a "readback" using an existing table that's created with the following 
setup:
   
https://github.com/apache/datafusion/blob/cb571d9e2e7da39da2aec5ca50283b9d1d8ffe46/datafusion/sqllogictest/test_files/insert_to_external.slt#L155-L160
   
   This results in the following directory/file structure:
   ```console
   
datafusion/sqllogictest/test_files/scratch/insert_to_external/insert_to_partitioned$
 tree .
   .
   ├── a=10
   │   ├── b=100
   │   │   └── ImzsRq5BY2218Z0v.csv
   │   └── b=200
   │       └── ImzsRq5BY2218Z0v.csv
   ├── a=20
   │   ├── b=100
   │   │   └── ImzsRq5BY2218Z0v.csv
   │   └── b=200
   │       └── ImzsRq5BY2218Z0v.csv
   └── a=30
       └── b=300
           └── YGMRSbmO08FmNfDs.csv
   
   9 directories, 5 files
   ```
   
   The existing tests prior to this PR all passed because the table was created 
with an explicit schema, so DataFusion recognized/parsed all the partition 
columns. The test I implemented for this PR starts a new table, and fails on 
`main` with the following errors:
   
   ```diff
   1. query result mismatch:
   [SQL] describe partitioned_insert_test_readback;
   [Diff] (-expected|+actual)
   -   c Int64 YES
   -   a Dictionary(UInt16, Utf8) NO
   -   b Dictionary(UInt16, Utf8) NO
   +   c Int64 YES
   at 
/home/blake/open_source_src/datafusion/datafusion/sqllogictest/test_files/insert_to_external.slt:184
   
   
   2. query failed: DataFusion error: Schema error: No field named a. Valid 
fields are partitioned_insert_test_readback.c, 
partitioned_insert_test_readback.c.
   [SQL] select * from partitioned_insert_test_readback order by a,b,c;
   at 
/home/blake/open_source_src/datafusion/datafusion/sqllogictest/test_files/insert_to_external.slt:191
   
   
   3. query failed: DataFusion error: Schema error: No field named b. Valid 
fields are partitioned_insert_test_readback.c.
   [SQL] select count(*) from partitioned_insert_test_readback where b=100;
   at 
/home/blake/open_source_src/datafusion/datafusion/sqllogictest/test_files/insert_to_external.slt:201
   ```
   
   I believe this sequence of tests successfully exercises both the creation of 
hive partitioned tables, as well as the expected outcome when reading that data.
   
   I'm happy to move this test, and the associated test setup, to a new file 
and/or augment it with the SQL commands you've laid out above. Given the above 
explanation, let me know if you'd still prefer to proceed with a new test file 
or if what's already there is sufficient.
   
   > 2. Add a [configuration 
setting](https://datafusion.apache.org/user-guide/configs.html#) to disable 
this new behavior (let people opt out of automatic partition inference). 
Perhaps datafusion.execution.infer_partitioning, similarly to 
datafusion.execution.collect_statistics
   
   I think it's prudent to note that the changes here only effect the 
`ListingTableFactory` and low level `ListingTable` users wouldn't experience 
any change. Is an `execution` option stil the right place for the configuration 
setting? `datafusion.execution.listing_table_factory_infer_partitions` with a 
default of `true`?
   
   > 3. Document the behavior (with an example) in 
https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table
   > 4. Add a note to the upgrade guide 
https://datafusion.apache.org/library-user-guide/upgrading.html
   
   No questions here, I can take care of this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to