mrocklin commented on issue #38389:
URL: https://github.com/apache/arrow/issues/38389#issuecomment-1779430327

   Things are still very much up in the air.  There's a lot of sensitivity to 
file size, machine type, etc..
   
   Clearly there is room for improvement with Arrow Filesystem.  I think that 
Dask developers are generally hoping that, with some configuration changes, we 
can get Arrow filesystem to be fast enough to be our default.  For example, I 
suspect that sometimes s3fs wins because ...
   
   -  We can change the chunksize (s3 seems to work best with 16 MB chunks)
   -  We can super-saturate threads (s3 seems to work well with 2-3 threads per 
core) (I suspect that arrow might be limiting concurrency more than is ideal)
   
   However, s3fs is slow for other reasons, like Python and the GIL (we can't 
scale it up on larger workers as easily).  Arrow seems to have more potential, 
but is currently harder for us to optimize.
   
   If it makes you feel better, I can easily produce a similar notebook that 
shows Arrow doing way better 🙂 
   
   There's more conversation going on here: 
https://github.com/coiled/benchmarks/issues/1125


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to