BlakeOrth opened a new issue, #17049:
URL: https://github.com/apache/datafusion/issues/17049
### Describe the bug
When using a `ListingTableFactory` (i.e. using the `datafusion-cli`) tables
that leverage Hive partitioning do not yield the expected columns derived from
the hive key/value pairs (and other such oddities).
### To Reproduce
A publicly accessible S3 bucket hosting Hive partitioned parquet data can be
used to show the issues. The general structure of the data can be seen below:
```console
$ aws s3 ls s3://overturemaps-us-west-2/release/2025-07-23.0/
PRE theme=addresses/
PRE theme=base/
PRE theme=buildings/
PRE theme=divisions/
PRE theme=places/
PRE theme=transportation/
```
```console
$ aws s3 ls s3://overturemaps-us-west-2/release/2025-07-23.0/ --recursive
2025-07-22 14:42:50 1073537247
release/2025-07-23.0/theme=addresses/type=address/part-00000-57f746d8-98a1-4faa-94c2-4f084343ecac-c000.zstd.parquet
2025-07-22 14:43:06 1013189887
release/2025-07-23.0/theme=addresses/type=address/part-00001-57f746d8-98a1-4faa-94c2-4f084343ecac-c000.zstd.parquet
. . .
```
Here are a couple of examples off odd behavior when trying to access this
data.
```sql
DataFusion CLI v49.0.0
> SET datafusion.execution.listing_table_ignore_subdirectory to false;
0 row(s) fetched.
Elapsed 0.001 seconds.
> CREATE EXTERNAL TABLE overture_maps
STORED AS PARQUET LOCATION
's3://overturemaps-us-west-2/release/2025-07-23.0/';
0 row(s) fetched.
Elapsed 0.645 seconds.
> describe overture_maps;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
+-------------+-----------+-------------+
0 row(s) fetched.
Elapsed 0.001 seconds.
```
In the above example no schema has been detected at all, either from Hive
keys or the underlying parquet data itself, in spite of ensuring the listing
table can traverse subdirectories.
```sql
> CREATE EXTERNAL TABLE overture_maps
STORED AS PARQUET LOCATION
's3://overturemaps-us-west-2/release/2025-07-23.0/theme=addresses/';
0 row(s) fetched.
Elapsed 0.911 seconds.
> select column_name, is_nullable from information_schema.columns where
table_name='overture_maps';
+----------------+-------------+
| column_name | is_nullable |
+----------------+-------------+
| id | YES |
| geometry | YES |
| bbox | YES |
| country | YES |
| postcode | YES |
| street | YES |
| number | YES |
| unit | YES |
| address_levels | YES |
| postal_city | YES |
| version | YES |
| sources | YES |
+----------------+-------------+
12 row(s) fetched.
Elapsed 0.005 seconds.
> select column_name, is_nullable from information_schema.columns where
table_name='overture_maps' and column_name='type';
+-------------+-------------+
| column_name | is_nullable |
+-------------+-------------+
+-------------+-------------+
0 row(s) fetched.
Elapsed 0.006 seconds.
```
Next, when a path to a subdirectory is given to create the table, the
underlying parquet schema is successfully detected, however the columns
corresponding to the Hive keys are still missing.
### Expected behavior
When accessing a hive partitioned dataset with no explicit schema set, and
having subdirectory traversal enabled, I would expect to get a table that can
be queried/filtered on both the columns represented in the parquet schema as
well as the hive partitions.
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]