daddywantssugar commented on issue #36283:
URL: https://github.com/apache/arrow/issues/36283#issuecomment-1605910313

   Here is a similar example where the problem is not as pronounced but on a 
more popular nyc-taxi dataset sorted by passenger count:
   
   ```
   import pandas as pd
   import pyarrow.dataset as ds
   import s3fs
   from contexttimer import Timer #pip install or use your own timer
   
   fs = s3fs.S3FileSystem(anon=True) # doesn't actually require s3, network 
share will exhibit this as well
   rawpath = f'nyc-taxi-test/taxi_sorted.parquet' # sorted by passenger_count 
to accentuate the issue
   filters = [
       (ds.field("passenger_count") == 10) | (ds.field("passenger_count") == 
6), #1.5s
       ds.field("passenger_count").isin([10, 6]), #6s
       None, # 58s  no filter baseline
   ]
   for filter in filters:
       with Timer() as t:
           with fs.open(rawpath, 'rb') as f:
               df = pd.read_parquet(f, filters=filter)
       print('time: ', t, 'size: ', len(df))
   
   """
   time:  1.079 size:  63817
   time:  6.445 size:  63817
   time:  58.441 size:  14092413
   """
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to