mrocklin commented on issue #38389: URL: https://github.com/apache/arrow/issues/38389#issuecomment-1779430327
Things are still very much up in the air. There's a lot of sensitivity to file size, machine type, etc.. Clearly there is room for improvement with Arrow Filesystem. I think that Dask developers are generally hoping that, with some configuration changes, we can get Arrow filesystem to be fast enough to be our default. For example, I suspect that sometimes s3fs wins because ... - We can change the chunksize (s3 seems to work best with 16 MB chunks) - We can super-saturate threads (s3 seems to work well with 2-3 threads per core) (I suspect that arrow might be limiting concurrency more than is ideal) However, s3fs is slow for other reasons, like Python and the GIL (we can't scale it up on larger workers as easily). Arrow seems to have more potential, but is currently harder for us to optimize. If it makes you feel better, I can easily produce a similar notebook that shows Arrow doing way better 🙂 There's more conversation going on here: https://github.com/coiled/benchmarks/issues/1125 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
