daddywantssugar commented on issue #36283:
URL: https://github.com/apache/arrow/issues/36283#issuecomment-1605910313
Here is a similar example where the problem is not as pronounced but on a
more popular nyc-taxi dataset sorted by passenger count:
```
import pandas as pd
import pyarrow.dataset as ds
import s3fs
from contexttimer import Timer #pip install or use your own timer
fs = s3fs.S3FileSystem(anon=True) # doesn't actually require s3, network
share will exhibit this as well
rawpath = f'nyc-taxi-test/taxi_sorted.parquet' # sorted by passenger_count
to accentuate the issue
filters = [
(ds.field("passenger_count") == 10) | (ds.field("passenger_count") ==
6), #1.5s
ds.field("passenger_count").isin([10, 6]), #6s
None, # 58s no filter baseline
]
for filter in filters:
with Timer() as t:
with fs.open(rawpath, 'rb') as f:
df = pd.read_parquet(f, filters=filter)
print('time: ', t, 'size: ', len(df))
"""
time: 1.079 size: 63817
time: 6.445 size: 63817
time: 58.441 size: 14092413
"""
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]