Re: [I] Timeouts reading large files from object stores on slow connetions [datafusion]

via GitHub Fri, 07 Mar 2025 10:01:34 -0800


alamb commented on issue #15067:
URL: https://github.com/apache/datafusion/issues/15067#issuecomment-2707059055


   > FWIW splitting large requests and performing them in parallel is something 
we could upstream into object_store's default get_ranges method. It already 
does the reverse.
   
   I think @crepererum  is also working on something similar ("Chunked 
Requests") for us internally at InfluxDB as various people noticed that you 
could actually often get more bandwidth and lower latency from S3 using 
multiple concurrent requests to the same object (though of course you pay 
amazon per request so the $$$ cost is higher)
   
   > Edit: That being said 200MB row groups is probably a problem in and of 
itself, and might suggest an issue with the writer's configuration.
   
   That particular file came from ClickBench which is not necessairly the best 
example of parquet files so in general I agree smaller row groups might be 
better
   
   To be clear,  I think the problem with "the single request that is made can 
not complete before the timeout is hit" is real and unfortunately it isn't like 
there is only one possible fix. There are a bunch of potential fixes that come 
with different tradeoffs 🤔 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Timeouts reading large files from object stores on slow connetions [datafusion]

Reply via email to