lidavidm commented on pull request #9482: URL: https://github.com/apache/arrow/pull/9482#issuecomment-781659830
I re-ran my benchmark from above on the NYC Taxi dataset and the latest version of the code, while we're waiting for Ursabot. The optimization here is worth an order of magnitude, and that's without tuning the pre-buffering settings to S3 itself. (Note that I'm running from within EC2, which would skew it significantly compared to what Ursabot eventually reports.) ``` # pre_buffer=true Data read: 4419.18 MiB Mean : 9.65 s Median : 9.49 s Stdev : 0.45 s Mean rate: 458.70 MiB/s ``` I interrupted the pre_buffer=False run on accident before it finished, but: ``` Duration: 113.71 s Rate: 38.86 MiB/s Duration: 111.82 s Rate: 39.52 MiB/s Duration: 109.96 s Rate: 40.19 MiB/s Duration: 108.82 s Rate: 40.61 MiB/s Duration: 111.06 s Rate: 39.79 MiB/s Duration: 111.14 s Rate: 39.76 MiB/s Duration: 111.66 s Rate: 39.58 MiB/s Duration: 113.84 s Rate: 38.82 MiB/s Duration: 111.17 s Rate: 39.75 MiB/s ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
