whyzdev commented on issue #31174: URL: https://github.com/apache/arrow/issues/31174#issuecomment-1457302230
looks like this is still an issue as of 11.0.0, but may be closed #16972 is still open, where filtered FileSystemDataset and caching were suggested/mentioned in the comments. Caching may already be done in Python user code, for example via monkey patching pyarrow dataset._filesystem_dataset. But this is at full dataset level, and difficult if not impossible to updated incrementally in Python, when one or a few partitions change frequently to avoid full eviction. The FileSystemDataset and underlying objects are in C++ not Python. So we may need some native support for caching by Arrow API. Btw #9670 since 4.0.0 seemed to be a separate enhancement for reading table but not for speeding up the loading of FileSystemDataset. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
