[GitHub] [arrow] westonpace commented on issue #34145: [Python] Specifying schema does not prevent arrow from reading metadata on every single parquet?

via GitHub Wed, 15 Feb 2023 14:51:10 -0800


westonpace commented on issue #34145:
URL: https://github.com/apache/arrow/issues/34145#issuecomment-1432181304


   The discovery code is roughly:
   
   ```
   def discover_dataset(directory):
     files = list_files_recursive(directory)
     if unify_schemas:
       return unify([get_schema(file) for file in files])
     else:
       return get_schema(files[0])
   ```
   
   I think the problem is that all of this time is being spent in 
`list_files_recursive`.  There may be two problems:
   
   1. Looking at the code, if there is a lot of nesting in the dataset, we may 
be doing too many list objects calls because I think list objects is inherently 
recursive.
   2. Even if we are not doing too many list objects calls we should be able to 
abort early if we know we only want one file.
   
   I'm going to investigate #1 a bit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #34145: [Python] Specifying schema does not prevent arrow from reading metadata on every single parquet?

Reply via email to