mrocklin commented on issue #38389: URL: https://github.com/apache/arrow/issues/38389#issuecomment-1779182179
I think that those files were relatively small. Unfortunately we've since replaced them with larger files. I'll re-run everything and include table size numbers as well, hopefully within the next hour. > And some machine might have different bandwidth, this should also be taken into account Certainly this is true. This is why I record S3 bandwidth at the beginning. That gives us a baseline. My goal isn't to get a high MiB/s number. It's to get a full pipeline that is close to S3 bandwidth. I hope that S3, rather than pyarrow.parquet, is the bottleneck. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
