amoeba commented on issue #39139:
URL: https://github.com/apache/arrow/issues/39139#issuecomment-1849203879

   Using that dataset, 
   
   ```r
   > dir("~/Datasets/nyc-taxi-high-volume-trip-2022/")
   [1] "fhvhv_tripdata_2022-01.parquet" "fhvhv_tripdata_2022-02.parquet" 
"fhvhv_tripdata_2022-03.parquet" "fhvhv_tripdata_2022-04.parquet" 
"fhvhv_tripdata_2022-05.parquet"
    [6] "fhvhv_tripdata_2022-06.parquet" "fhvhv_tripdata_2022-07.parquet" 
"fhvhv_tripdata_2022-08.parquet" "fhvhv_tripdata_2022-09.parquet" 
"fhvhv_tripdata_2022-10.parquet"
   [11] "fhvhv_tripdata_2022-11.parquet" "fhvhv_tripdata_2022-12.parquet"
   ```
   
   I get ~20 seconds on my machine which is quite a bit slower than the larger 
nyc-taxi dataset I used before which is interesting. It's possible some detail 
about how those Parquet files were serialized is impacting things here. 
Ultimately, 20 seconds is a lot slower than I'd expect so I think it would be 
good for us to look into that.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to