cboettig opened a new issue, #34145:
URL: https://github.com/apache/arrow/issues/34145
### Describe the bug, including details regarding any error messages,
version, and platform.
Consider the following reprex, in which we open a partitioned parquet
dataset on a remote S3 bucket:
```python
import pyarrow.dataset as ds
from pyarrow import fs
from timebudget import timebudget
@timebudget
def without_schema():
s3 = fs.S3FileSystem(endpoint_override = "data.ecoforecast.org", anonymous
= True)
df = ds.dataset(
"neon4cast-forecasts/parquet/terrestrial_30min",
format="parquet",
filesystem=s3
)
return(df)
df = without_schema()
schema = df.schema
```
This takes a whooping 102 seconds on my machine. I believe most of the
computation is associated with checking the metadata found in each parquet
file, since there are many individual partitions in this data. This process
should not be necessary though if we provide the schema:
```python
@timebudget
def with_schema():
s3 = fs.S3FileSystem(endpoint_override = "data.ecoforecast.org", anonymous
= True)
df2 = ds.dataset(
"neon4cast-forecasts/parquet/terrestrial_30min",
format="parquet",
filesystem=s3,
schema = schema
)
return(df2)
with_schema()
```
But observe the execution time is once again 102 seconds. Note that if we
manually specify a single partition, the process takes only 2.7 seconds:
```python
@timebudget
def single():
s3 = fs.S3FileSystem(endpoint_override = "data.ecoforecast.org", anonymous
= True)
df = ds.dataset(
"neon4cast-forecasts/parquet/terrestrial_30min/model_id=climatology/reference_datetime=2022-12-01
00:00:00/date=2022-12-02",
format="parquet",
filesystem=s3
)
return(df)
single()
```
I would have expected similar performance between these two cases:
`ds.dataset()` should just be establishing a connection with the schema and not
reading any data. After all part of the promise of partitioned data is that we
could open the dataset at the root of the parquet db and rely on filtering
operations to extract a specific subset without the code ever touching all the
other parquet files.
My guess here is that arrow is trying to read all the parquet file metadata
to ensure they all match the schema, even though that is not the expected
behavior. I think this is the same issue as seen in R
https://github.com/apache/arrow/issues/33312. But maybe I'm not doing
something correct or miss-understand the expected behavior?
### Component(s)
Parquet, Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]