Hi Yann, This patch looks like it should fix the hang we've seen based on the instruction trace provided from a bug report ( http://i.imgur.com/RI3TKrU.png ) where essentially, queries_sent ( https://github.com/pagespeed/mod_pagespeed/blob/master/third_party/aprutil/apr_memcache2.c#L1414 ) is never decremented.
The only reports of the bug have had HAProxy along with memcached, so I'm not sure if something funky is going on there closing a connection in an odd way. One quick question on the parse_size function though. Based on the protocol.txt for memcached, it seems like it should also be valid if endptr[0] == ' ', endptr[1] == '\r', entptr[2] == '\n' or even if just endptr[0] == ' ', no? https://github.com/memcached/memcached/blob/master/doc/protocol.txt#L227 I'm not really quite sure if the protol description is correct. Should it in fact read VALUE <key> <flags> <bytes>[ <cas unique>]\r\n instead of VALUE <key> <flags> <bytes> [<cas unique>]\r\n And if the optional cas_unique token is present, wont this fail? Jeff On Thu, Sep 24, 2015 at 10:33 AM, Yann Ylavic <ylavic....@gmail.com> wrote: > Hi Jeffrey, > > On Wed, Sep 23, 2015 at 9:23 PM, Jeffrey Crowell <jcrow...@google.com> > wrote: > > > > We have some patches which were created in an attempt to fix some signed > > bugs in the original code that I think may be causing our issue. > > > > Namely here: > > > https://github.com/pagespeed/mod_pagespeed/blob/master/third_party/aprutil/apr_memcache2.c#L1453 > > vs here > > > https://gist.github.com/crowell/59bfa1bb9f0cda30c48a#file-multiget-c-L192 > > The original code uses an atoi(), which has undefined behavior when > called > > on non integer strings, leaving len to be 0. > > > > Second, the check here > > > https://github.com/pagespeed/mod_pagespeed/blob/master/third_party/aprutil/apr_memcache2.c#L1457 > > vs > https://gist.github.com/crowell/59bfa1bb9f0cda30c48a#file-multiget-c-L199 > > > > In the original apr code, this check will never fail. len is an > apr_size_t, > > and will never be < 0. > > The code you refer has changed with [1] (i.e. APR-1.4.3, the current > version is 1.5.2), but still does not seem to fix the invalid/unknown > length/value/type issue, which are both unexpected errors in > apr_memcache, and as such should probably terminate the connection... > > > > > The issue here now is that the check in the "server sent back a key that > i > > didn't ask for" > > > https://github.com/pagespeed/mod_pagespeed/blob/master/third_party/aprutil/apr_memcache2.c#L1507 > > (same in both forked and upstream), apr_pollset_remove, is never called, > and > > the number of queries sent is never changed. > > > > Does it seem possible that this is causing the server to spin here > > Yes, in the unexpected cases mentioned above, the socket is not > "consumed" (i.e. apr_brigade_partition() is not called), which let > some immediatly poll()able data for the next loops. > > > would apr be interested in accepting a patch to fix the signedness > issues? > > Of course! I included parse_size() in the attached patch which tries > to address the parsing/unexpected issues by aborting the connection > and returning an error. > Does it work for you (it is based on trunk, so you may need to adapt > it to your version, which preferably should be the latest...)? > > I'm not sure about the handling of empty values (the real len == 0 > case, not the parsing error). > The previous code did not expect the trailing CRLF, while this patch > does... > > Regards, > Yann. > > [1] http://svn.apache.org/r982408 >