adriangb opened a new issue, #13270:
URL: https://github.com/apache/datafusion/issues/13270
### Describe the bug
With CSV:
```shell
echo "a,b\n1,2" > data1.csv
mkdir a=2
echo "b\n3" > a=2/data2.csv
datafusion-cli
> SELECT * FROM '**/*.csv';
Arrow error: Csv error: incorrect number of fields for line 1, expected 2
got 1
```
With Parquet:
```python
import os
import polars as pl
pl.DataFrame({'a': [1], 'b': [2]}).write_parquet('data1.parquet')
os.mkdir('a=2')
pl.DataFrame({'b': [3]}).write_parquet('a=2/data2.parquet')
```
```shell
datafusion-cli
> SELECT * FROM '**/*.parquet';
+---+---+
| b | a |
+---+---+
| 2 | 1 |
| 3 | |
+---+---+
2 row(s) fetched.
Elapsed 0.055 seconds.
```
### To Reproduce
_No response_
### Expected behavior
Partition evolution is handled and both cases return
```
+---+---+
| b | a |
+---+---+
| 2 | 1 |
| 3 | 2 |
+---+---+
```
### Additional context
Having played around quite a bit with ParquetExec and the SchemaAdapter
machinery I think what should happen is:
- Partition values are on a per-file basis, in particular on each
`PartitionedFile` and not on the `FileScanConfig`
- Partition values are passed into the SchemaAdapter machinery and for each
file it decides if it needs to add a column generated from partition values or
not
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]