adriangb opened a new issue, #13270:
URL: https://github.com/apache/datafusion/issues/13270

   ### Describe the bug
   
   With CSV:
   
   ```shell
   echo "a,b\n1,2" > data1.csv
   mkdir a=2
   echo "b\n3" > a=2/data2.csv
   datafusion-cli
   > SELECT * FROM '**/*.csv';
   Arrow error: Csv error: incorrect number of fields for line 1, expected 2 
got 1
   ```
   
   With Parquet:
   
   ```python
   import os
   import polars as pl
   
   pl.DataFrame({'a': [1], 'b': [2]}).write_parquet('data1.parquet')
   os.mkdir('a=2')
   pl.DataFrame({'b': [3]}).write_parquet('a=2/data2.parquet')
   ```
   
   ```shell
   datafusion-cli
   > SELECT * FROM '**/*.parquet';
   +---+---+
   | b | a |
   +---+---+
   | 2 | 1 |
   | 3 |   |
   +---+---+
   2 row(s) fetched.
   Elapsed 0.055 seconds.
   ```
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   Partition evolution is handled and both cases return
   
   ```
   +---+---+
   | b | a |
   +---+---+
   | 2 | 1 |
   | 3 | 2 |
   +---+---+
   ```
   
   ### Additional context
   
   Having played around quite a bit with ParquetExec and the SchemaAdapter 
machinery I think what should happen is:
   - Partition values are on a per-file basis, in particular on each 
`PartitionedFile` and not on the `FileScanConfig`
   - Partition values are passed into the SchemaAdapter machinery and for each 
file it decides if it needs to add a column generated from partition values or 
not


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to