Re: Reactive integration test failures; Re: Flow control window overflow

Ryan Schmitt Sun, 22 Dec 2019 15:38:02 -0800

I think there are actually two field stores where we can race: one is
`subscription`, and the other is `requests`. Also, I think that the
simplest thing to do is to mark the `flushToSubscriber` method
`synchronized`. (The `flushInProgress` field is used to implement
non-reentrancy, rather than mutual exclusion; it turns out we need both.)


Also, is there a reason why httpcore5 doesn't have an Slf4j dependency? I'd
like to keep my Reactive debug logging (at DEBUG or TRACE) level, but I had
to add an Slf4j just to write it in the first place.

On Sun, Dec 22, 2019 at 3:21 PM Ryan Schmitt <[email protected]> wrote:

> I think I figured it out. I added debug logging to ReactiveDataConsumer
> and ReactiveEntityConsumer, and then I injected randomized sleeps at
> various points in the code in order to try to induce races. Check out the
> following:
>
>
> https://gist.githubusercontent.com/rschmitt/51ba439f5141e6cac8b6e148e29c61fd/raw/2f4e13012451e93c90ccda29ad9f1c37c051f1ed/reactive-race.txt
>
> Here, the two threads (an RxJava thread and a client thread) race in such
> a way that the call to #streamEnd never results in a call to
> Subscriber#onComplete from either thread:
>
> 1. The client calls #streamEnd
> 2. The client acquires the flush lock
> 3. The client sees that the subscriber is `null` and decides to return
> 4. RxJava subscribes
> 5. RxJava sees that the lock is already held and returns
> 6. The client releases the lock
>
> In the logs, (3) and (4) are reported in reverse order. The `subscriber`
> field is volatile, so we are dealing with a race condition, not a memory
> model issue.
>
> I'll need to give some thought to how to fix this. The obvious solution is
> for ReactiveDataConsumer#subscribe to acquire (spinwait on?) the flush lock
> before the store to the `subscriber` field takes place.
>
> On Sun, Dec 22, 2019 at 5:10 AM Oleg Kalnichevski <[email protected]>
> wrote:
>
>> On Sun, 2019-12-22 at 10:28 +0100, Oleg Kalnichevski wrote:
>> > On Sat, 2019-12-21 at 10:58 +0100, Oleg Kalnichevski wrote:
>> > >
>> >
>> >
>> > ...
>> >
>> > >
>> > > It appears that the defect only occurs with TLS transport and
>> > > entity
>> > > enclosing requests. It also looks to be a race condition of some
>> > > sort.
>> > > It is nearly impossible to reproduce with context / wire logging
>> > > on.
>> > >
>> > > I will continue looking into it tomorrow.
>> > >
>> > > Oleg
>> > >
>> >
>> > Hi Ryan
>> >
>> > I can now reliably reproduce the issue with context / wire logging
>> > on.
>> >
>> > So far it looks quite certain there is a race condition of a sort in
>> > the reactive binding layer.
>> >
>> > All failing tests get stuck here
>> >
>> >
>>
>> https://github.com/apache/httpcomponents-client/pull/181/commits/c1b060703ac57fc4806d1249b906a385648862b5#diff-4aa21c6723799de37acda71cbe858faaR254
>> >
>> > However all message exchanges clearly look completed in the context
>> > log
>> > and there appear to be no issues in the protocol handling layer.
>> >
>> > It would be great if you could take a look. I will continue debugging
>> > reactive binding code in the meantime.
>> >
>> > Cheers
>> >
>> > Oleg
>> >
>> >
>>
>> Without #publisherToByteArray conversion all tests pass consistently
>> without a single failure after many repetitions.
>>
>> I will start looking into ReactiveDataConsumer. Any help would be
>> appreciated.
>>
>> Oleg
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

Re: Reactive integration test failures; Re: Flow control window overflow

Reply via email to