alamb commented on issue #15067:
URL: https://github.com/apache/datafusion/issues/15067#issuecomment-2707059055
> FWIW splitting large requests and performing them in parallel is something
we could upstream into object_store's default get_ranges method. It already
does the reverse.
I think @crepererum is also working on something similar ("Chunked
Requests") for us internally at InfluxDB as various people noticed that you
could actually often get more bandwidth and lower latency from S3 using
multiple concurrent requests to the same object (though of course you pay
amazon per request so the $$$ cost is higher)
> Edit: That being said 200MB row groups is probably a problem in and of
itself, and might suggest an issue with the writer's configuration.
That particular file came from ClickBench which is not necessairly the best
example of parquet files so in general I agree smaller row groups might be
better
To be clear, I think the problem with "the single request that is made can
not complete before the timeout is hit" is real and unfortunately it isn't like
there is only one possible fix. There are a bunch of potential fixes that come
with different tradeoffs 🤔
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]