dlg99 commented on issue #1088: ISSUE #1086 (@bug W-4146427@) Client-side 
backpressure in netty (Fixes: io.netty.util.internal.OutOfDirectMemoryError 
under continuous heavy load)
URL: https://github.com/apache/bookkeeper/pull/1088#issuecomment-361739328
 
 
   @sijie I start wit p.2:
   
   backpressure is enabled all the time. I.e. CHANNEL_WAIT_TIMEOUT_ON_WRITE 
does not affect read request, LAC etc.
   I.e. read has different ways to issue speculative retries to other bookie.
   
   It works because write will get blocked in PendingAddOp when it is 
submitting requests
   ```java
               for (int i = 0; i < writeSet.size(); i++) {
                   sendWriteRequest(writeSet.get(i));
               }
   ```
   
   sendWriteRequest gets blocked if we block on netty.
   In our case app limits number of requests in flight.
   i.e. it can have 50 writes in flight and in the current ensemble 2 bookies 
able to handle this while the 3rd one is slow or goes through long GC.
   Without this change we end up submitting data to netty to all 3 bookies and 
submit more as soon as two of them ack the write. netty in this case keeps on 
buffering data for the 3rd bookie and finally we were getting OODME.
   With this change request ends up being blocked in sendWriteRequest to a slow 
bookie until it either succeeds or fails to submit (hence 
CHANNEL_WAIT_TIMEOUT_ON_WRITE to limit wait for writes specifically).
   
   The change does not help if app can submit unlimited number fo requests, I 
totally agree. 
   I think that should be addressed in a separate change building on top of 
this one. 
   
   There is also server side of the backpressure story not addressed in this 
change, specifically:
   - server has to stop accepting requests if it cannot process them fast enough
   - server has to do something if it cannot send responses to client fast 
enough (slow client case) -> either stop accepting requests, or drop responses, 
or combo of two
   
   p.1:  I have comparison of throughput with different sizes of HWM for netty 
(LWM = HWM-1M).
   Without the change test failed with write error at anywhere between 35min 
and 58min.
   In this case I managed to run load overnight, no OODME, no write errors.
   
   
![hwatermark-tests](https://user-images.githubusercontent.com/8622884/35591468-e32b49e0-05be-11e8-88cc-ae59a909e278.png)
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to