whyzdev commented on issue #31174:
URL: https://github.com/apache/arrow/issues/31174#issuecomment-1459413932

   @westonpace  Thanks for confirming the dataset discovery without filtering, 
and pointing out exclude_invalid_files as well as the S3 issue.
   
   Regarding exclude_invalid_files, it's actually False by default although the 
doc in dataset() says (default True, which may need a fix?) See discovery.h and 
test_dataset.py. Also it's not exposed as a default argument in 
pyarrow.parquet.read_table().
   
   When I called dataset() with exclude_invalid_files=True, it indeed became 
even slower than the default without the argument, for example about 12 seconds 
vs 7 seconds on a Windows network drive for about 730 partitions (2 years by 
date) with two parquet 1-MB files in each, I don't have the time numbers for S3 
yet, but will need to. My problem has been that even 7 seconds is considered 
too slow for reading just some partitions, and it may get much worse when there 
are more partitions, and when the network drive is busy.
   
   Before this can be improved in Arrow C++, my option is limited.  I may have 
to add a partition level (year) and force to read within a particular partition 
rather than the top level dataset.  Another option is partial caching rather 
than a full FileSystemDataset, I will look and try ParquetDataset with filter 
and _ParquetDatasetV2._dataset (partial FileSystemDataset), Thanks to [another 
comments in in #16972](
   https://github.com/apache/arrow/issues/16972#issuecomment-1377578614) 
regarding pyarrow.dataset.parquet_dataset vs pyarrow.dataset.*.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to