alamb opened a new issue, #12510: URL: https://github.com/apache/datafusion/issues/12510
### Is your feature request related to a problem or challenge? In the [ClickBench benchmark queries, there are two datasets we use](https://github.com/ClickHouse/ClickBench?tab=readme-ov-file#data-loading). A "single file" `hits.parquet` and "partitioned" which has 100 files in a directory. They hold the same data. However DataFusion resolves `hits.parquet` such that columns like `URL` are a `Utf8` or `Utf8View` while the same columns are resolved as `Binary` or `BinaryView` This has caused some small slowdowns while enabling StringView by default -- see https://github.com/apache/datafusion/issues/12509 You can see the schema resolution by: ```shell cd benchmarks # download hits.parquet ./bench.sh data clickbench_1 # download hits_partitioned ./bench.sh data clickbench_partitioned ``` Then run `datafusion-cli`: ```shell cd data # hits.parquet has Utf8 columns datafusion-cli -c 'describe "hits.parquet"' | grep Utf8 | Title | Utf8 | NO | | URL | Utf8 | NO | | Referer | Utf8 | NO | ... | UTMContent | Utf8 | NO | | UTMTerm | Utf8 | NO | | FromTag | Utf8 | NO | # hits_patitioned has Binary type for the same columns datafusion-cli -c 'describe "hits_partitioned"' | grep Binary | Title | Binary | YES | | URL | Binary | YES | | Referer | Binary | YES | ... | UTMContent | Binary | YES | | UTMTerm | Binary | YES | | FromTag | Binary | YES | ``` It semes for some reason the individual files are all resolved to `Binary`: ``` datafusion-cli -c 'describe "hits_partitioned/hits_99.parquet"' | grep Binary | Title | Binary | YES | | URL | Binary | YES | | Referer | Binary | YES | | FlashMinor2 | Binary | YES | | UserAgentMinor | Binary | YES | ... datafusion-cli -c 'describe "hits_partitioned/hits_60.parquet"' | grep Binary | Title | Binary | YES | | URL | Binary | YES | | Referer | Binary | YES | | FlashMinor2 | Binary | YES | | UserAgentMinor | Binary | YES | ... ``` ### Describe the solution you'd like I would like ideally that the clickbench queries resolve to the same schema, in this case Utf8 given the contents of the files and the queries that treat it them as strings ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
