jorisvandenbossche commented on pull request #7438:
URL: https://github.com/apache/arrow/pull/7438#issuecomment-644732807
I think we talked before about the difference between a "physical" schema
and a "reader" (dataset) schema.
Right now a Fragment only knows about the physical schema, while here we
need to know the dataset schema. To know this, we could 1) infer this from the
partition expression as you do here in this PR, 2) keep (optionally) a
reference to the dataset schema on the Fragment, or 3) let the user pass this
schema.
This third option we actually already do for
`Fragment.scan/to_table/to_batches()`.
And I had forgotten that when opening the issue. Because for the example I
showed for `to_table` on a fragment which raises an error:
```
In [34]: fragment.to_table(filter=ds.field("part") == "A").to_pandas()
...
ArrowInvalid: Field named 'part' not found or not unique in the schema.
```
this actually works fine if you specify the dataset schema:
```
In [38]: fragment.to_table(filter=ds.field("part") == "A",
schema=dataset.schema).to_pandas()
Out[38]:
dummy part
0 1 A
1 1 A
```
So the better solution might be to do something similar for
`SplitByRowGroup` ?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]