Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)

via GitHub Tue, 28 May 2024 02:35:57 -0700


rohangarg commented on PR #16481:
URL: https://github.com/apache/druid/pull/16481#issuecomment-2134768598


   > The serial upload is uploading 5 MBs one-by-one, hence takes an extremely 
long time.
   
   Ok, although the bandwidth comes out to be around < 1 MB/sec with this 
calculation. So, I'm not sure about realistic nature of the test.
   
   > Actually it's the opposite. Previously the writes to the output stream 
(local file) had to wait for the previous chunk write+upload to finish. Now, 
the write doesn't have to wait for the previous uploads to finish (there's a 
backpressure semantic to restrict too many writes, but that's different from 
the current point). Please let me know if this answers the question, happy to 
clarify this further!
   
   Removing backpressure from the situation, previously if you have written a 
chunk, you'll upload it and wait for the upload to finish to start writing 
another chunk - and all that will happen without any extra wait time. Now, if 
you've written a chunk, you'll put it in an executor which might be consumed by 
the parts of other writers and you'll wait in the queue for your upload to be 
initiated. My question was that is the new behavior expected and fine?
   
   > Can you please clarify what do you mean by "leaking it into the output 
stream"? Sorry, I didn't get this part.
   
   I meant that the code for semaphores directly in the output stream seems 
more related to the behavior of the output stream rather than a property of the 
uploading mechanism.
   
   > But this would work only when there's a single chunk. How about when there 
are 5 chunks, for example? We'd have to wait for the last chunk to be processed 
before we can make the decision that the total number of chunks * chunkSize is 
small enough to go with PUT approach (and we would have to hold on to 
processing the initial 4 chunks until then as well). Also, supporting the PUT 
flow along with the multipart approach doesn't seem ideal to me, it would be 
one extra thing to maintain for not much benefits. That's my opinion on this 
atleast. Happy to discuss it further.
   
   Yes, this would only work for a single chunk. I've never recommended to 
extend it any further than a single chunk. Further, a multi-part request would 
involve 3 API calls (which are chargeable) for small uploads as compared to a 
PUT call which only makes 1 API call. Also, I think the multi-part API was 
intended more to be used as a resilient mechanism for large file uploads. So, 
it feels like unnecessary to me for small files. In any case, it was a 
suggestion for optimization and you can feel free to ignore it if you want.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Optimize S3 storage writing for MSQ durable storage (druid)

Reply via email to