BlakeOrth commented on PR #17232: URL: https://github.com/apache/datafusion/pull/17232#issuecomment-3215970925
@alamb > I know this is a lot. I am sorry for the delayed feedback. Please let me know what you think. No worries on the delay, I can absolutely fill in the extra blanks. I just have a couple of quick questions on how to proceed. > 1. More tests (see below) The `sqllogic` test I added actually does work on a hive partitioned table. I setup a "readback" using an existing table that's created with the following setup: https://github.com/apache/datafusion/blob/cb571d9e2e7da39da2aec5ca50283b9d1d8ffe46/datafusion/sqllogictest/test_files/insert_to_external.slt#L155-L160 This results in the following directory/file structure: ```console datafusion/sqllogictest/test_files/scratch/insert_to_external/insert_to_partitioned$ tree . . ├── a=10 │ ├── b=100 │ │ └── ImzsRq5BY2218Z0v.csv │ └── b=200 │ └── ImzsRq5BY2218Z0v.csv ├── a=20 │ ├── b=100 │ │ └── ImzsRq5BY2218Z0v.csv │ └── b=200 │ └── ImzsRq5BY2218Z0v.csv └── a=30 └── b=300 └── YGMRSbmO08FmNfDs.csv 9 directories, 5 files ``` The existing tests prior to this PR all passed because the table was created with an explicit schema, so DataFusion recognized/parsed all the partition columns. The test I implemented for this PR starts a new table, and fails on `main` with the following errors: ```diff 1. query result mismatch: [SQL] describe partitioned_insert_test_readback; [Diff] (-expected|+actual) - c Int64 YES - a Dictionary(UInt16, Utf8) NO - b Dictionary(UInt16, Utf8) NO + c Int64 YES at /home/blake/open_source_src/datafusion/datafusion/sqllogictest/test_files/insert_to_external.slt:184 2. query failed: DataFusion error: Schema error: No field named a. Valid fields are partitioned_insert_test_readback.c, partitioned_insert_test_readback.c. [SQL] select * from partitioned_insert_test_readback order by a,b,c; at /home/blake/open_source_src/datafusion/datafusion/sqllogictest/test_files/insert_to_external.slt:191 3. query failed: DataFusion error: Schema error: No field named b. Valid fields are partitioned_insert_test_readback.c. [SQL] select count(*) from partitioned_insert_test_readback where b=100; at /home/blake/open_source_src/datafusion/datafusion/sqllogictest/test_files/insert_to_external.slt:201 ``` I believe this sequence of tests successfully exercises both the creation of hive partitioned tables, as well as the expected outcome when reading that data. I'm happy to move this test, and the associated test setup, to a new file and/or augment it with the SQL commands you've laid out above. Given the above explanation, let me know if you'd still prefer to proceed with a new test file or if what's already there is sufficient. > 2. Add a [configuration setting](https://datafusion.apache.org/user-guide/configs.html#) to disable this new behavior (let people opt out of automatic partition inference). Perhaps datafusion.execution.infer_partitioning, similarly to datafusion.execution.collect_statistics I think it's prudent to note that the changes here only effect the `ListingTableFactory` and low level `ListingTable` users wouldn't experience any change. Is an `execution` option stil the right place for the configuration setting? `datafusion.execution.listing_table_factory_infer_partitions` with a default of `true`? > 3. Document the behavior (with an example) in https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table > 4. Add a note to the upgrade guide https://datafusion.apache.org/library-user-guide/upgrading.html No questions here, I can take care of this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org