I think I've found the issue; there's an extraneous apr_pollset_remove()
call in event.c:1978. Relevant code (double-slash comments are mine):

    /* We only signal once per N sockets with this baton */
    if (!(baton->signaled)) {
        baton->signaled = 1;
        te = event_get_timer_event(-1 /* fake timer */,
                                   baton->cbfunc,
                                   baton->user_baton,
                                   0, /* don't insert it */
                                   NULL /* no associated socket callback
*/);
        /* remove other sockets in my set */
        // says remove "other" sockets, but it really removes all
        for (i = 0; i < baton->nsock; i++) {
            rc = apr_pollset_remove(event_pollset, baton->pfds[i]);
        }

        push_timer2worker(te);
    }
    // This causes the problem; if the sockets are processed and re-added
    //  before this is called, the backend socket gets re-removed, never to
    //  be seen again (until the client sends something)
    apr_pollset_remove(event_pollset, out_pfd);

Seems like the solution is just to remove the extra apr_pollset_remove();
all sockets are removed from the pollset when one of them is signalled. If
another one was signalled in the same poll, it will enter this block and
simply do nothing (because of the baton->signaled check). I gave it a quick
test and it looked like the problem was solved.


On Tue, Jul 22, 2014 at 10:46 AM, Edward Lu <chaos...@gmail.com> wrote:

> I got it to break with these python scripts, as well as with my own test
> client and server (browser javascript and node.js). Strange though, I can't
> reproduce it with only 2 clients on either of those programs; I have to get
> 3 clients connected at once. Past that, I see exactly the behavior you
> describe.
>
> I agree that it seems to be some sort of timing issue - I've been trying
> to put trace statements into the HTTPD to see what information I can get,
> but once I do that it starts behaving as expected. Still investigating...
>
>
> On Tue, Jul 22, 2014 at 10:40 AM, Steve Zweep <steve.zw...@watchguard.com>
> wrote:
>
>> Another tidbit of information ...
>>
>> I attempted to see what was going on by running httpd in the foreground
>> with strace. When I did that, the tests worked perfectly. I guess this
>> suggests it is some sort of timing issue. Running httpd in the foreground
>> without trace exhibited the problem.
>>
>> Here's the output of strace during the test. I sent the string "hhhhhhh"
>> from client 1 and it was echoed back by the server to both clients, as
>> expected.
>>
>> [pid 15451] <... futex resumed> )       = 0
>> [pid 15451] futex(0x891ec54, FUTEX_WAKE_PRIVATE, 1) = 0
>> [pid 15451] epoll_wait(13, {{EPOLLIN, {u32=144017752, u64=144017752}}},
>> 2, 0) = 1
>> [pid 15451] read(12, "\201\10hhhhhhhh", 8000) = 10
>> [pid 15451] mprotect(0xb6a3a000, 8192, PROT_READ|PROT_WRITE) = 0
>> [pid 15451] read(12, 0xb6a39aa8, 8000)  = -1 EAGAIN (Resource temporarily
>> unavailable)
>> [pid 15451] writev(11, [{"\201\10hhhhhhhh", 10}], 1) = 10
>> [pid 15451] read(12, 0xa96084b0, 8000)  = -1 EAGAIN (Resource temporarily
>> unavailable)
>> [pid 15451] epoll_wait(13, {}, 2, 0)    = 0
>> [pid 15451] epoll_ctl(7, EPOLL_CTL_ADD, 11, {EPOLLIN, {u32=144044472,
>> u64=144044472}}) = 0
>> [pid 15451] epoll_ctl(7, EPOLL_CTL_ADD, 12, {EPOLLIN, {u32=144044496,
>> u64=144044496}}) = 0
>> [pid 15451] gettimeofday({1406038173, 745644}, NULL) = 0
>> [pid 15451] epoll_wait(10, {{EPOLLIN, {u32=2843761856, u64=2843761856}}},
>> 2, 0) = 1
>> [pid 15451] read(9, "\201\10hhhhhhhh", 8000) = 10
>> [pid 15451] read(9, 0xb6a39aa8, 8000)   = -1 EAGAIN (Resource temporarily
>> unavailable)
>> [pid 15451] writev(8, [{"\201\10hhhhhhhh", 10}], 1) = 10
>> [pid 15451] read(9, 0xa96084b0, 8000)   = -1 EAGAIN (Resource temporarily
>> unavailable)
>> [pid 15451] epoll_wait(10, {}, 2, 0)    = 0
>> [pid 15451] epoll_ctl(7, EPOLL_CTL_ADD, 8, {EPOLLIN, {u32=2843796920,
>> u64=2843796920}}) = 0
>> [pid 15451] epoll_ctl(7, EPOLL_CTL_ADD, 9, {EPOLLIN, {u32=2843796944,
>> u64=2843796944}}) = 0
>> [pid 15451] gettimeofday({1406038173, 745832}, NULL) = 0
>> [pid 15451] futex(0x891ec88, FUTEX_WAIT_PRIVATE, 38, NULL <unfinished ...>
>>
>>
>> - Steve
>>
>> -----Original Message-----
>> From: Steve Zweep
>> Sent: Monday, July 21, 2014 11:56 AM
>> To: 'dev@httpd.apache.org'
>> Subject: RE: Question about async mod_proxy_wstunnel and threads
>>
>> My client and server are based on sample code included with the ws4py
>> package (http://ws4py.readthedocs.org/en/latest/). We use CherryPy and
>> the ws4py package contains a CherryPy-based server (attached).
>>
>> The client is a variation of the echo_client.py sample code also included
>> in the ws4py package.
>>
>> In terms of failure rate, I'd say it is about 90%. Often I start 2
>> clients and initially it works OK, but then stops after a message or two.
>> Once it has failed, I can send any number of messages from the one client
>> and continue to see the server echo them back, but no message is received
>> by the second. When I then send a message from the second client I see a
>> flood of messages from the server that had been queued.
>>
>> - Steve
>>
>>
>> -----Original Message-----
>> From: Eric Covener [mailto:cove...@gmail.com]
>> Sent: Saturday, July 19, 2014 8:16 AM
>> To: Apache HTTP Server Development List
>> Subject: Re: Question about async mod_proxy_wstunnel and threads
>>
>> I guess it is kind of telling that we could still respond to the write
>> from client 2, there are probably more potential errors that make us lose
>> track of both sides simultaneously.
>>
>> Can you share your ws client and server if they're easy to use? My little
>> sample doesn't spray that way.
>>
>> I'm also curious approximately how often it fails this way.
>>
>> On Fri, Jul 18, 2014 at 5:50 PM, Yann Ylavic <ylavic....@gmail.com>
>> wrote:
>> > Probably a false trail, trunk at r1611741 is up to date, I think 1.5.x
>> > won't help.
>> >
>> > On Fri, Jul 18, 2014 at 11:35 PM, Steve Zweep
>> > <steve.zw...@watchguard.com> wrote:
>> >> Hi Yann,
>> >>
>> >> The test I ran today was built with APR from trunk (r1611741). I see
>> that r1605769 modifies apr_skiplist.c and there has been a lot of activity
>> in the  trunk version of that code lately. I can try with the 1.5.x branch
>> code to see if it makes a difference. Probably won't get to this before
>> Monday though.
>> >>
>> >> - Steve
>> >>
>> >> -----Original Message-----
>> >> From: Yann Ylavic [mailto:ylavic....@gmail.com]
>> >> Sent: Friday, July 18, 2014 4:51 PM
>> >> To: httpd
>> >> Subject: Re: Question about async mod_proxy_wstunnel and threads
>> >>
>> >> Hi Steve,
>> >>
>> >> can you still reproduce with the latest APR 1.5.x, notably containing
>> this fix: http://svn.apache.org/r1605769.
>> >> I don't think there is a released version with this patch...
>> >>
>> >> Regards,
>> >> Yann.
>> >>
>> >> On Fri, Jul 18, 2014 at 9:38 PM, Steve Zweep <
>> steve.zw...@watchguard.com> wrote:
>> >>> I've attached annotated logs that show the issues I described. Both
>> scenarios have ProxyWebsocketAsync turned on. The first does not use the
>> AsyncDelay and shows how server messages stall and are not delivered until
>> the client polls. The second has ProxyWebsocketAsyncDelay set to 100. In
>> that case, message processing works properly, but threads are held open and
>> there is no sign of async processing.
>> >>>
>> >>> Since my build and execution environment were somewhat non-standard,
>> I repeated all the tests today on stock Ubuntu 14.04, with a fresh checkout
>> and build of httpd and apr trunk code. The same results were observed.
>> >>>
>> >>> - Steve
>> >>>
>> >>>
>> >>>
>> >>> -----Original Message-----
>> >>> ________________________________________
>> >>> From: Eric Covener [cove...@gmail.com]
>> >>> Sent: July 17, 2014 9:15 PM
>> >>> To: Apache HTTP Server Development List
>> >>> Subject: Re: Question about async mod_proxy_wstunnel and threads
>> >>>
>> >>> I am having trouble seeing it mis-behave. w/ Async and AsyncDelay, I
>> >>> am seeing the expected trace messages and when I look at backtraces
>> >>> of httpd I can see zero threads in wstunnel . If I send a server
>> >>> msg, I get it ASAP in the client -- and then I see 1 thread in poll
>> >>> for the right couple of seconds
>> >>>
>> >>> Can you grab trace at e.g.
>> >>>
>> >>> LogLevel INFO proxy_wstunnel_module:trace8
>> >>>
>> >>> And annotate the timing a bit for what you do in the client?  Is it
>> possible you have an un-updated trunk from several weeks ago?  There was an
>> optimization put in and backed out that might have broke some of these same
>> things for a very short window.
>> >>>
>>
>>
>>
>> --
>> Eric Covener
>> cove...@gmail.com
>>
>
>

Reply via email to