[I] `ListingTableFactory` yields unexpected results for Hive partitioned datasets [datafusion]

via GitHub Tue, 05 Aug 2025 15:48:32 -0700


BlakeOrth opened a new issue, #17049:
URL: https://github.com/apache/datafusion/issues/17049


   ### Describe the bug
   
   When using a `ListingTableFactory` (i.e. using the `datafusion-cli`) tables 
that leverage Hive partitioning do not yield the expected  columns derived from 
the hive key/value pairs (and other such oddities).
   
   ### To Reproduce
   
   A publicly accessible S3 bucket hosting Hive partitioned parquet data can be 
used to show the issues. The general structure of the data can be seen below:
   ```console
   $ aws s3 ls s3://overturemaps-us-west-2/release/2025-07-23.0/
                              PRE theme=addresses/
                              PRE theme=base/
                              PRE theme=buildings/
                              PRE theme=divisions/
                              PRE theme=places/
                              PRE theme=transportation/
   ```
   ```console
   $ aws s3 ls s3://overturemaps-us-west-2/release/2025-07-23.0/ --recursive
   2025-07-22 14:42:50 1073537247 
release/2025-07-23.0/theme=addresses/type=address/part-00000-57f746d8-98a1-4faa-94c2-4f084343ecac-c000.zstd.parquet
   2025-07-22 14:43:06 1013189887 
release/2025-07-23.0/theme=addresses/type=address/part-00001-57f746d8-98a1-4faa-94c2-4f084343ecac-c000.zstd.parquet
   . . .
   ```
   
   Here are a couple of examples off odd behavior when trying to access this 
data.
   ```sql
   DataFusion CLI v49.0.0
   > SET datafusion.execution.listing_table_ignore_subdirectory to false;
   0 row(s) fetched.
   Elapsed 0.001 seconds.
   
   > CREATE EXTERNAL TABLE overture_maps
   STORED AS PARQUET LOCATION 
's3://overturemaps-us-west-2/release/2025-07-23.0/';
   0 row(s) fetched.
   Elapsed 0.645 seconds.
   
   > describe overture_maps;
   +-------------+-----------+-------------+
   | column_name | data_type | is_nullable |
   +-------------+-----------+-------------+
   +-------------+-----------+-------------+
   0 row(s) fetched.
   Elapsed 0.001 seconds.
   ```
   In the above example no schema has been detected at all, either from Hive 
keys or the underlying parquet data itself, in spite of ensuring the listing 
table can traverse subdirectories.
   
   ```sql
   > CREATE EXTERNAL TABLE overture_maps
   STORED AS PARQUET LOCATION 
's3://overturemaps-us-west-2/release/2025-07-23.0/theme=addresses/';
   0 row(s) fetched.
   Elapsed 0.911 seconds.
   
   > select column_name, is_nullable from information_schema.columns where 
table_name='overture_maps';
   +----------------+-------------+
   | column_name    | is_nullable |
   +----------------+-------------+
   | id             | YES         |
   | geometry       | YES         |
   | bbox           | YES         |
   | country        | YES         |
   | postcode       | YES         |
   | street         | YES         |
   | number         | YES         |
   | unit           | YES         |
   | address_levels | YES         |
   | postal_city    | YES         |
   | version        | YES         |
   | sources        | YES         |
   +----------------+-------------+
   12 row(s) fetched.
   Elapsed 0.005 seconds.
   
   > select column_name, is_nullable from information_schema.columns where 
table_name='overture_maps' and column_name='type';
   +-------------+-------------+
   | column_name | is_nullable |
   +-------------+-------------+
   +-------------+-------------+
   0 row(s) fetched.
   Elapsed 0.006 seconds.
   ```
   Next, when a path to a subdirectory is given to create the table, the 
underlying parquet schema is successfully detected, however the columns 
corresponding to the Hive keys are still missing.
   
   ### Expected behavior
   
   When accessing a hive partitioned dataset with no explicit schema set, and 
having subdirectory traversal enabled, I would expect to get a table that can 
be queried/filtered on both the columns represented in the parquet schema as 
well as the hive partitions.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] `ListingTableFactory` yields unexpected results for Hive partitioned datasets [datafusion]

Reply via email to