[GitHub] [arrow] lidavidm commented on pull request #9482: ARROW-11601: [C++][Python][Dataset] expose Parquet pre-buffer option

GitBox Thu, 18 Feb 2021 13:55:06 -0800


lidavidm commented on pull request #9482:
URL: https://github.com/apache/arrow/pull/9482#issuecomment-781659830



   I re-ran my benchmark from above on the NYC Taxi dataset and the latest 
version of the code, while we're waiting for Ursabot. The optimization here is 
worth an order of magnitude, and that's without tuning the pre-buffering 
settings to S3 itself. (Note that I'm running from within EC2, which would skew 
it significantly compared to what Ursabot eventually reports.)
   
   ```
   # pre_buffer=true
   Data read: 4419.18 MiB
   Mean     : 9.65 s
   Median   : 9.49 s
   Stdev    : 0.45 s
   Mean rate: 458.70 MiB/s
   ```
   
   I interrupted the pre_buffer=False run on accident before it finished, but:
   ```
   Duration: 113.71 s
   Rate:     38.86 MiB/s
   Duration: 111.82 s
   Rate:     39.52 MiB/s
   Duration: 109.96 s
   Rate:     40.19 MiB/s
   Duration: 108.82 s
   Rate:     40.61 MiB/s
   Duration: 111.06 s
   Rate:     39.79 MiB/s
   Duration: 111.14 s
   Rate:     39.76 MiB/s
   Duration: 111.66 s
   Rate:     39.58 MiB/s
   Duration: 113.84 s
   Rate:     38.82 MiB/s
   Duration: 111.17 s
   Rate:     39.75 MiB/s
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] lidavidm commented on pull request #9482: ARROW-11601: [C++][Python][Dataset] expose Parquet pre-buffer option

Reply via email to