[GitHub] [arrow] whyzdev commented on issue #31174: [C++] Reduce directory and file IO when reading partition parquet dataset with partition key filters

via GitHub Mon, 06 Mar 2023 17:01:39 -0800


whyzdev commented on issue #31174:
URL: https://github.com/apache/arrow/issues/31174#issuecomment-1457302230


   looks like this is still an issue as of 11.0.0, but may be closed 
   #16972 is still open, where filtered FileSystemDataset and caching were 
suggested/mentioned in the comments.
   Caching may already be done in Python user code, for example via monkey 
patching pyarrow dataset._filesystem_dataset. But this is at full dataset 
level, and difficult if not impossible to updated incrementally in Python, when 
one or a few partitions change frequently to avoid full eviction. The 
FileSystemDataset and underlying objects are in C++ not Python. So we may need 
some native support for caching by Arrow API.
   
   Btw #9670 since 4.0.0 seemed to be a separate enhancement for reading table 
but not for speeding up the loading of FileSystemDataset.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] whyzdev commented on issue #31174: [C++] Reduce directory and file IO when reading partition parquet dataset with partition key filters

Reply via email to