[GitHub] [arrow] nealrichardson commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

GitBox Tue, 14 Jul 2020 19:26:21 -0700


nealrichardson commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658507782



   > I think the rationale is that the memory and performance savings related 
to materializing the partition columns are mos significant with string data. So 
it's definitely beneficial to return them as dictionary types.
   
   Right, my understanding from Joris's last comment was that this was already 
converting strings to dictionaries, which seems like a reasonable (though not 
mandatory) choice, and that the hangup was whether it was essential to also do 
that for ints.
   
   I guess the other workaround if people aren't happy with the choice here is 
to set `use_legacy_dataset = True`, so I agree that it's not the end of the 
world if the choice we make about dictionaries today turns out not to be 
optimal. But we should merge this so that the default is to use the datasets 
API so that we can learn where exactly we were mistaken.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] nealrichardson commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Reply via email to