Akshat-Jain commented on PR #16481: URL: https://github.com/apache/druid/pull/16481#issuecomment-2134545306
@rohangarg > Thanks, I checked earlier but I couldn't understand their timings much. For instance, were you able to reason about why the serial upload is taking ~ 8 minutes for 450MBs? The serial upload is uploading 5 MBs one-by-one, hence takes an extremely long time. > Also, the chunk size in those queries seemed a bit de-generate to me. So, I had to reduce the chunk size to lowest possible value to be able to simulate upload of a large number of chunks, as I can't do it otherwise due to limited system resources. But it shouldn't affect the comparison/improvement trend, hence we can still rely on this data to get some idea on the degree of savings. > I was actually meaning that the threads which are running and are writing to the output stream currently are able to write it to S3 too on the same thread, which might not be the case which this change and the writes might have to wait. Are we solving that scenario in this change? Actually it's the opposite. Previously the writes to the output stream (local file) had to wait for the previous chunk write+upload to finish. Now, the write doesn't have to wait for the previous uploads to finish (there's a backpressure semantic to restrict too many writes, but that's different from the current point). Please let me know if this answers the question, happy to clarify this further! > have we considered supporting parallel uploads via the execution framework (maybe an operator to fan out writes) instead of doing it on a per storage basis? No, we hadn't discussed any approach like that. cc: @cryptoe > But I would do that in the executor related code itself instead of leaking it into the output stream. Can you please clarify what do you mean by "leaking it into the output stream"? Sorry, I didn't get this part. > You can determine that the file is small when not even one local chunk is fully filled, and the output stream is closed. And once you've determined that a file is small, direct PUT can be used over multi-part upload. But this would work only when there's a single chunk. How about when there are 5 chunks, for example? We'd have to wait for the last chunk to be processed before we can make the decision that the `total number of chunks * chunkSize` is small enough to go with PUT approach. Also, supporting the PUT flow along with the multipart approach doesn't seem ideal to me, it would be one extra thing to maintain for not much benefits. That's my opinion on this atleast. Happy to discuss it further. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
