Great, thanks for testing it. I'll commit soon. -Jonathan
On Mon, Sep 14, 2009 at 5:37 PM, Simon Smith <[email protected]> wrote: > Jonathan: > > I tried out the patch you attached to JIRA-440, I applied it to 0.4, > and it works for me. Now, as soon as I take the node down, there may > be one or two seconds of the thrift-internal error (timeout) but as > soon as the host doing the querying can see the node is down, the > error stops, and valid output is given by the get_key_range query > again. And there isn't any disruption when the node comes back up. > > Thanks! (I put this same note in the bug report). > > Simon Smith > > > > > On Fri, Sep 11, 2009 at 9:38 AM, Simon Smith <[email protected]> wrote: >> https://issues.apache.org/jira/browse/CASSANDRA-440 >> >> Thanks again, of course I'm happy to give any additional information >> and will gladly do any testing of the fix. >> >> Simon >> >> >> On Thu, Sep 10, 2009 at 7:32 PM, Jonathan Ellis <[email protected]> wrote: >>> That confirms what I suspected, thanks. >>> >>> Can you file a ticket on Jira and I'll work on a fix for you to test? >>> >>> thanks, >>> >>> -Jonathan >>> >>> On Thu, Sep 10, 2009 at 4:42 PM, Simon Smith<[email protected]> wrote: >>>> I sent get_key_range to node #1 (174.143.182.178), and here are the >>>> resulting log lines from 174.143.182.178's log (Do you want the other >>>> nodes' log lines? Let me know if so.) >>>> >>>> DEBUG - get_key_range >>>> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash, >>>> startWith='', stopAt='', maxResults=100) from [email protected]:7000 >>>> DEBUG - collecting :false:3...@1252535119 >>>> [ ... chop the repeated & identical collecting messages ... ] >>>> DEBUG - collecting :false:3...@1252535119 >>>> DEBUG - Sending RangeReply(keys=[java, java1, java2, java3, java4, >>>> java5, match, match1, match2, match3, match4, match5, newegg, newegg1, >>>> newegg2, newegg3, newegg4, newegg5, now, now1, now2, now3, now4, now5, >>>> sgs, sgs1, sgs2, sgs3, sgs4, sgs5, test, test1, test2, test3, test4, >>>> test5, xmind, xmind1, xmind2, xmind3, xmind4, xmind5], >>>> completed=false) to [email protected]:7000 >>>> DEBUG - Processing response on an async result from >>>> [email protected]:7000 >>>> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash, >>>> startWith='', stopAt='', maxResults=58) from [email protected]:7000 >>>> DEBUG - Processing response on an async result from >>>> [email protected]:7000 >>>> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash, >>>> startWith='', stopAt='', maxResults=58) from [email protected]:7000 >>>> DEBUG - Processing response on an async result from >>>> [email protected]:7000 >>>> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash, >>>> startWith='', stopAt='', maxResults=22) from [email protected]:7000 >>>> DEBUG - Processing response on an async result from >>>> [email protected]:7000 >>>> DEBUG - Disseminating load info ... >>>> >>>> >>>> Thanks, >>>> >>>> Simon >>>> >>>> On Thu, Sep 10, 2009 at 5:25 PM, Jonathan Ellis <[email protected]> wrote: >>>>> I think I see the problem. >>>>> >>>>> Can you check if your range query is spanning multiple nodes in the >>>>> cluster? You can tell by setting the log level to DEBUG, and looking >>>>> for after it logs get_key_range, it will say "reading >>>>> RangeCommand(...) from ... @machine" more than once. >>>>> >>>>> The bug is that when picking the node to start the range query it >>>>> consults the failure detector to avoid dead nodes, but if the query >>>>> spans nodes it does not do that on subsequent nodes. >>>>> >>>>> But if you are only generating one RangeCommand per get_key_range then >>>>> we have two bugs. :) >>>>> >>>>> -Jonathan >>>>> >>>> >>> >> >
