> We should work on some kind of back-pressure mechanism for the client, but I > am not sure about which kind of support we should provide at BK level
Is there an issue for this? If there isn't, then perhaps we can start that way. > And as soon as the application is notified of the result of the write > (success or failure) we are releasing the reference to the payload (as I have > shown in this email thread), > so in theory the application has full control over the retained memory and it > can apply its own memory management mechanisms Say we have a ledger for which WQ > AQ. Are you saying that if AQ bookies reply, then it is possible that the entry is not going to be written to |WQ| - |AQ| bookies because the entry data might have been reclaimed by the application? The contract as I understand it is that an entry is to be replicated |WQ| ways, even though the application is willing to receive a confirmation after |AQ| bookie responses. What am I missing? -Flavio > On 13 Jan 2021, at 11:30, Enrico Olivelli <eolive...@gmail.com> wrote: > > Flavio > > Il giorno mar 12 gen 2021 alle ore 17:26 Flavio Junqueira <f...@apache.org > <mailto:f...@apache.org>> ha scritto: > I have observed the issue that Matteo describes and I also attributed the > problem to the absence of a back pressure mechanism in the client. Issue > #2497 was not about that, though. There was some corruption going on that was > leading to the server receiving garbage. > > Correct, #2497 is not about the topic of this email, I just mentioned it > because the discussion started from that comment from Matteo. > > We should work on some kind of back-pressure mechanism for the client, but I > am not sure about which kind of support we should provide at BK level > > Regarding the writer side of this story and memory usage, > we are not performing copies of the original payload that the caller is > passing, in case of a ByteBuf > see PendingAddOp > https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L263 > > <https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L263> > and here, we simply wrap it in a ByteBufList > https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L116 > > <https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L116> > > And as soon as the application is notified of the result of the write > (success or failure) we are releasing the reference to the payload (as I have > shown in this email thread), > so in theory the application has full control over the retained memory > and it can apply its own memory management mechanisms > > > Enrico > > > -Flavio > > > On 8 Jan 2021, at 22:47, Matteo Merli <mme...@apache.org > > <mailto:mme...@apache.org>> wrote: > > > > On Fri, Jan 8, 2021 at 8:27 AM Enrico Olivelli <eolive...@gmail.com > > <mailto:eolive...@gmail.com>> wrote: > >> > >> Hi Matteo, > >> in this comment you are talking about an issue you saw when WQ is greater > >> that AQ > >> https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246 > >> <https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246> > >> > >> IIUC you are saying that if one bookie is slow the client continues to > >> accumulate references to the entries that still have not received the > >> confirmation from it. > >> I think that this is correct. > >> > >> Have you seen problems in production related to this scenario ? > >> Can you tell more about them ? > > > > Yes, for simplicity, assume e=3, w=3, a=2. > > > > If one bookie is slow (not down, just slow), the BK client will the > > acks to the user that the entries are written after the first 2 acks. > > In the meantime, it will keep waiting for the 3rd bookie to respond. > > If the bookie responds within the timeout, the entries can now be > > dropped from memory, otherwise the write will timeout internally and > > it will get replayed to a new bookie. > > > > In both cases, the amount of memory used in the client will max at > > "throughput" * "timeout". This can be a large amount of memory and > > easily cause OOM errors. > > > > Part of the problem is that it cannot be solved from outside the BK > > client, since there's no visibility on what entries have 2 or 3 acks > > and therefore it's not possible to apply backpressure. Instead, > > there should be a backpressure mechanism in the BK client itself to > > prevent this kind of issue. > > One possibility there could be to use the same approach as described > > in > > https://github.com/apache/pulsar/wiki/PIP-74%3A-Pulsar-client-memory-limits > > <https://github.com/apache/pulsar/wiki/PIP-74%3A-Pulsar-client-memory-limits>, > > giving a max memory limit per BK client instance and throttling > > everything after the quota is reached. > > > > > > Matteo >