zhuqi-lucas opened a new issue, #21755:
URL: https://github.com/apache/datafusion/issues/21755
### Describe the bug
When a hive-partitioned listing table has files in the root directory (not
inside any `partition_col=value/` path), queries that reference partition
columns fail with:
```
Arrow error: Schema error: Unable to get field named "year_month".
Valid fields: ["col1", "col2", ...]
```
This happens because `try_into_partitioned_file` in
`datafusion/catalog-listing/src/helpers.rs` includes root-level files with
empty `partition_values` (via `parsed.into_iter().flatten()`). When the query
engine later tries to resolve partition column values for these files, it fails.
### To Reproduce
1. Create a hive-partitioned external table pointing to a directory that
contains both:
- Root-level files: `s3://bucket/table/data.parquet`
- Partitioned files: `s3://bucket/table/year_month=2024-01/data.parquet`
2. Query with partition column reference:
```sql
SELECT year_month, COUNT(*) FROM table GROUP BY year_month
```
3. Error: `Unable to get field named "year_month"`
This is a common scenario when a table transitions from non-partitioned to
hive-partitioned storage — the original root file may still exist alongside the
new partition directories.
### Expected behavior
Files outside the partition structure should be skipped (with a debug log),
since hive partition values are never null and there is no valid value to
assign.
### Additional context
- `parse_partitions_for_path` already returns `None` for non-partition
files, but the caller (`try_into_partitioned_file`) converts `None` to empty
`partition_values` via `.flatten()`
- This also causes `Cannot merge statistics with different number of
columns` if the root file has a different schema than partitioned files
- The root file may also cause incorrect `COUNT(*)` results (double-counting
data)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]