[GitHub] [arrow] cboettig opened a new issue, #34145: [Python] Specifying schema does not prevent arrow from reading metadata on every single parquet?

via GitHub Sat, 11 Feb 2023 09:49:09 -0800


cboettig opened a new issue, #34145:
URL: https://github.com/apache/arrow/issues/34145


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Consider the following reprex, in which we open a partitioned parquet 
dataset on a remote S3 bucket:
   
   ```python
   import pyarrow.dataset as ds
   from pyarrow import fs
   from timebudget import timebudget
   
   @timebudget
   def without_schema():
     s3 = fs.S3FileSystem(endpoint_override = "data.ecoforecast.org", anonymous 
= True)
     df = ds.dataset(
         "neon4cast-forecasts/parquet/terrestrial_30min",
         format="parquet",
         filesystem=s3
     )
     return(df)
   
   
   
   df = without_schema()
   schema = df.schema
   ```
   
   This takes a whooping 102 seconds on my machine.  I believe most of the 
computation is associated with checking the metadata found in each parquet 
file, since there are many individual partitions in this data.  This process 
should not be necessary though if we provide the schema:
   
   ```python
   @timebudget
   def with_schema():
     s3 = fs.S3FileSystem(endpoint_override = "data.ecoforecast.org", anonymous 
= True)
     df2 = ds.dataset(
         "neon4cast-forecasts/parquet/terrestrial_30min",
         format="parquet",
         filesystem=s3,
         schema = schema
     )
     return(df2)
   
   with_schema()
   ```
   
   But observe the execution time is once again 102 seconds.  Note that if we 
manually specify a single partition, the process takes only 2.7 seconds: 
   
   ```python
   @timebudget
   def single():
     s3 = fs.S3FileSystem(endpoint_override = "data.ecoforecast.org", anonymous 
= True)
     df = ds.dataset(
         
"neon4cast-forecasts/parquet/terrestrial_30min/model_id=climatology/reference_datetime=2022-12-01
 00:00:00/date=2022-12-02",
         format="parquet",
         filesystem=s3
     )
     return(df)
   
   single()
   ```
   
   I would have expected similar performance between these two cases: 
`ds.dataset()` should just be establishing a connection with the schema and not 
reading any data.  After all part of the promise of partitioned data is that we 
could open the dataset at the root of the parquet db and rely on filtering 
operations to extract a specific subset without the code ever touching all the 
other parquet files.
   
   My guess here is that arrow is trying to read all the parquet file metadata 
to ensure they all match the schema, even though that is not the expected 
behavior.  I think this is the same issue as seen in R 
https://github.com/apache/arrow/issues/33312.  But maybe I'm not doing 
something correct or miss-understand the expected behavior?  
   
   
   
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] cboettig opened a new issue, #34145: [Python] Specifying schema does not prevent arrow from reading metadata on every single parquet?

Reply via email to