rohangarg commented on PR #16481: URL: https://github.com/apache/druid/pull/16481#issuecomment-2134768598
> The serial upload is uploading 5 MBs one-by-one, hence takes an extremely long time. Ok, although the bandwidth comes out to be around < 1 MB/sec with this calculation. So, I'm not sure about realistic nature of the test. > Actually it's the opposite. Previously the writes to the output stream (local file) had to wait for the previous chunk write+upload to finish. Now, the write doesn't have to wait for the previous uploads to finish (there's a backpressure semantic to restrict too many writes, but that's different from the current point). Please let me know if this answers the question, happy to clarify this further! Removing backpressure from the situation, previously if you have written a chunk, you'll upload it and wait for the upload to finish to start writing another chunk - and all that will happen without any extra wait time. Now, if you've written a chunk, you'll put it in an executor which might be consumed by the parts of other writers and you'll wait in the queue for your upload to be initiated. My question was that is the new behavior expected and fine? > Can you please clarify what do you mean by "leaking it into the output stream"? Sorry, I didn't get this part. I meant that the code for semaphores directly in the output stream seems more related to the behavior of the output stream rather than a property of the uploading mechanism. > But this would work only when there's a single chunk. How about when there are 5 chunks, for example? We'd have to wait for the last chunk to be processed before we can make the decision that the total number of chunks * chunkSize is small enough to go with PUT approach (and we would have to hold on to processing the initial 4 chunks until then as well). Also, supporting the PUT flow along with the multipart approach doesn't seem ideal to me, it would be one extra thing to maintain for not much benefits. That's my opinion on this atleast. Happy to discuss it further. Yes, this would only work for a single chunk. I've never recommended to extend it any further than a single chunk. Further, a multi-part request would involve 3 API calls (which are chargeable) for small uploads as compared to a PUT call which only makes 1 API call. Also, I think the multi-part API was intended more to be used as a resilient mechanism for large file uploads. So, it feels like unnecessary to me for small files. In any case, it was a suggestion for optimization and you can feel free to ignore it if you want. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
