whyzdev commented on issue #31174: URL: https://github.com/apache/arrow/issues/31174#issuecomment-1459413932
@westonpace Thanks for confirming the dataset discovery without filtering, and pointing out exclude_invalid_files as well as the S3 issue. Regarding exclude_invalid_files, it's actually False by default although the doc in dataset() says (default True, which may need a fix?) See discovery.h and test_dataset.py. Also it's not exposed as a default argument in pyarrow.parquet.read_table(). When I called dataset() with exclude_invalid_files=True, it indeed became even slower than the default without the argument, for example about 12 seconds vs 7 seconds on a Windows network drive for about 730 partitions (2 years by date) with two parquet 1-MB files in each, I don't have the time numbers for S3 yet, but will need to. My problem has been that even 7 seconds is considered too slow for reading just some partitions, and it may get much worse when there are more partitions, and when the network drive is busy. Before this can be improved in Arrow C++, my option is limited. I may have to add a partition level (year) and force to read within a particular partition rather than the top level dataset. Another option is partial caching rather than a full FileSystemDataset, I will look and try ParquetDataset with filter and _ParquetDatasetV2._dataset (partial FileSystemDataset), Thanks to [another comments in in #16972]( https://github.com/apache/arrow/issues/16972#issuecomment-1377578614) regarding pyarrow.dataset.parquet_dataset vs pyarrow.dataset.*. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
