[GitHub] [arrow] jorisvandenbossche commented on pull request #7438: ARROW-9105: [C++][Dataset][Python] Infer partition schema from partition expression

GitBox Tue, 16 Jun 2020 05:32:15 -0700


jorisvandenbossche commented on pull request #7438:
URL: https://github.com/apache/arrow/pull/7438#issuecomment-644732807



   I think we talked before about the difference between a "physical" schema 
and a "reader" (dataset) schema. 
   Right now a Fragment only knows about the physical schema, while here we 
need to know the dataset schema. To know this, we could 1) infer this from the 
partition expression as you do here in this PR, 2) keep (optionally) a 
reference to the dataset schema on the Fragment, or 3) let the user pass this 
schema.
   
   This third option we actually already do for 
`Fragment.scan/to_table/to_batches()`. 
   And I had forgotten that when opening the issue. Because for the example I 
showed for `to_table` on a fragment which raises an error:
   
   ```
   In [34]: fragment.to_table(filter=ds.field("part") == "A").to_pandas() 
   ...
   ArrowInvalid: Field named 'part' not found or not unique in the schema.
   ```
   
   this actually works fine if you specify the dataset schema:
   
   ```
   In [38]: fragment.to_table(filter=ds.field("part") == "A", 
schema=dataset.schema).to_pandas()
   Out[38]: 
      dummy part
   0      1    A
   1      1    A
   ```
   
   So the better solution might be to do something similar for 
`SplitByRowGroup` ? 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on pull request #7438: ARROW-9105: [C++][Dataset][Python] Infer partition schema from partition expression

Reply via email to