I think there are actually two field stores where we can race: one is `subscription`, and the other is `requests`. Also, I think that the simplest thing to do is to mark the `flushToSubscriber` method `synchronized`. (The `flushInProgress` field is used to implement non-reentrancy, rather than mutual exclusion; it turns out we need both.)
Also, is there a reason why httpcore5 doesn't have an Slf4j dependency? I'd like to keep my Reactive debug logging (at DEBUG or TRACE) level, but I had to add an Slf4j just to write it in the first place. On Sun, Dec 22, 2019 at 3:21 PM Ryan Schmitt <[email protected]> wrote: > I think I figured it out. I added debug logging to ReactiveDataConsumer > and ReactiveEntityConsumer, and then I injected randomized sleeps at > various points in the code in order to try to induce races. Check out the > following: > > > https://gist.githubusercontent.com/rschmitt/51ba439f5141e6cac8b6e148e29c61fd/raw/2f4e13012451e93c90ccda29ad9f1c37c051f1ed/reactive-race.txt > > Here, the two threads (an RxJava thread and a client thread) race in such > a way that the call to #streamEnd never results in a call to > Subscriber#onComplete from either thread: > > 1. The client calls #streamEnd > 2. The client acquires the flush lock > 3. The client sees that the subscriber is `null` and decides to return > 4. RxJava subscribes > 5. RxJava sees that the lock is already held and returns > 6. The client releases the lock > > In the logs, (3) and (4) are reported in reverse order. The `subscriber` > field is volatile, so we are dealing with a race condition, not a memory > model issue. > > I'll need to give some thought to how to fix this. The obvious solution is > for ReactiveDataConsumer#subscribe to acquire (spinwait on?) the flush lock > before the store to the `subscriber` field takes place. > > On Sun, Dec 22, 2019 at 5:10 AM Oleg Kalnichevski <[email protected]> > wrote: > >> On Sun, 2019-12-22 at 10:28 +0100, Oleg Kalnichevski wrote: >> > On Sat, 2019-12-21 at 10:58 +0100, Oleg Kalnichevski wrote: >> > > >> > >> > >> > ... >> > >> > > >> > > It appears that the defect only occurs with TLS transport and >> > > entity >> > > enclosing requests. It also looks to be a race condition of some >> > > sort. >> > > It is nearly impossible to reproduce with context / wire logging >> > > on. >> > > >> > > I will continue looking into it tomorrow. >> > > >> > > Oleg >> > > >> > >> > Hi Ryan >> > >> > I can now reliably reproduce the issue with context / wire logging >> > on. >> > >> > So far it looks quite certain there is a race condition of a sort in >> > the reactive binding layer. >> > >> > All failing tests get stuck here >> > >> > >> >> https://github.com/apache/httpcomponents-client/pull/181/commits/c1b060703ac57fc4806d1249b906a385648862b5#diff-4aa21c6723799de37acda71cbe858faaR254 >> > >> > However all message exchanges clearly look completed in the context >> > log >> > and there appear to be no issues in the protocol handling layer. >> > >> > It would be great if you could take a look. I will continue debugging >> > reactive binding code in the meantime. >> > >> > Cheers >> > >> > Oleg >> > >> > >> >> Without #publisherToByteArray conversion all tests pass consistently >> without a single failure after many repetitions. >> >> I will start looking into ReactiveDataConsumer. Any help would be >> appreciated. >> >> Oleg >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >>
