zhuqi-lucas opened a new issue, #21755:
URL: https://github.com/apache/datafusion/issues/21755

   ### Describe the bug
   
   When a hive-partitioned listing table has files in the root directory (not 
inside any `partition_col=value/` path), queries that reference partition 
columns fail with:
   
   ```
   Arrow error: Schema error: Unable to get field named "year_month". 
   Valid fields: ["col1", "col2", ...]
   ```
   
   This happens because `try_into_partitioned_file` in 
`datafusion/catalog-listing/src/helpers.rs` includes root-level files with 
empty `partition_values` (via `parsed.into_iter().flatten()`). When the query 
engine later tries to resolve partition column values for these files, it fails.
   
   ### To Reproduce
   
   1. Create a hive-partitioned external table pointing to a directory that 
contains both:
      - Root-level files: `s3://bucket/table/data.parquet`
      - Partitioned files: `s3://bucket/table/year_month=2024-01/data.parquet`
   
   2. Query with partition column reference:
   ```sql
   SELECT year_month, COUNT(*) FROM table GROUP BY year_month
   ```
   
   3. Error: `Unable to get field named "year_month"`
   
   This is a common scenario when a table transitions from non-partitioned to 
hive-partitioned storage — the original root file may still exist alongside the 
new partition directories.
   
   ### Expected behavior
   
   Files outside the partition structure should be skipped (with a debug log), 
since hive partition values are never null and there is no valid value to 
assign.
   
   ### Additional context
   
   - `parse_partitions_for_path` already returns `None` for non-partition 
files, but the caller (`try_into_partitioned_file`) converts `None` to empty 
`partition_values` via `.flatten()`
   - This also causes `Cannot merge statistics with different number of 
columns` if the root file has a different schema than partitioned files
   - The root file may also cause incorrect `COUNT(*)` results (double-counting 
data)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to