If I continuously read from the node that I am rebooting, the request made to that node hangs until the client times out, subsequent requests receive a "Failed to connect" error.
I am using curl for my tests. Thanks, Dan Daniel Reverri Developer Advocate Basho Technologies, Inc. [email protected] On Mon, Nov 29, 2010 at 10:27 AM, Jay Adkisson <[email protected]> wrote: > Hm, that's curious. Are you rebooting the physical machine? When you > reboot one of the nodes, what happens to HTTP calls to that node? Do they > immediately error, or do they hang indefinitely? > > In the meanwhile, I'll add some logging so I can see whether I'm timing out > on the writes as well, and I'll see what happens with different keys. > > Thanks, > --Jay > > > On Mon, Nov 29, 2010 at 10:02 AM, Dan Reverri <[email protected]> wrote: > >> Hi Jay, >> >> I'm not able to reproduce the behavior you are seeing. Here is what I am >> doing to try to reproduce the issue: >> 1. Setup a 4 node cluster >> 2. Continuously write a new object to Riak every 0.5 second >> 3. Continuously read a known object (GET riak/test/1) from Riak every 0.5 >> second >> 4. Reboot one of the nodes >> >> The reads and writes continue working normally when rebooting the node. >> >> Do you see timeouts while writing objects to Riak? >> Can you try reading other objects from Riak during the reboot (i.e. >> different keys)? >> >> Thanks, >> Dan >> >> Daniel Reverri >> Developer Advocate >> Basho Technologies, Inc. >> [email protected] >> >> >> On Mon, Nov 29, 2010 at 9:39 AM, Jay Adkisson <[email protected]> wrote: >> >>> Hey Dan/Sean, >>> >>> Thanks for the response. sasl-error.log on node A is completely empty, >>> and I see this pattern in erlang.log: >>> >>> ===== ALIVE Tue Nov 23 12:46:57 PST 2010 >>> >>> ===== Tue Nov 23 12:57:36 PST 2010 >>> >>> =ERROR REPORT==== 23-Nov-2010::12:57:36 === >>> ** Node 'riak@<node D>' not responding ** >>> ** Removing (timedout) connection ** >>> >>> =INFO REPORT==== 23-Nov-2010::12:58:41 === >>> Starting handoff of partition riak_kv_vnode >>> 251195593916248939066258330623111144003363405824 to 'riak@<node D>' >>> >>> =INFO REPORT==== 23-Nov-2010::12:58:41 === >>> Handoff of partition riak_kv_vnode >>> 251195593916248939066258330623111144003363405824 to 'riak@<node D>' >>> completed: sent 1 objects in 0.02 seconds >>> =INFO REPORT==== 23-Nov-2010::12:59:18 === >>> Starting handoff of partition riak_kv_vnode >>> 707914855582156101004909840846949587645842325504 to 'riak@<node D>' >>> >>> =INFO REPORT==== 23-Nov-2010::12:59:18 === >>> Handoff of partition riak_kv_vnode >>> 707914855582156101004909840846949587645842325504 to 'riak@<node D>' >>> completed: sent 5 objects in 0.03 seconds >>> =INFO REPORT==== 23-Nov-2010::12:59:20 === >>> Starting handoff of partition riak_kv_vnode >>> 525227150915793236229449236757414210188850757632 to 'riak@<node D>' >>> >>> <handoffs, etc...> >>> >>> This is my testing process: I'm doing an initial load into riak of small >>> image files between 1 and 150K, throttled to two images per second, with >>> W=1. In a different terminal, I'm running a wget every second against node >>> A of one particular image I already know to be in the cluster, again with >>> R=1. I'm using R,W=1 because I figured that would reduce the chance of >>> timing out, and with my data pattern, nothing I write to the cluster will >>> ever change, so I really don't need to wait for a quorum. >>> >>> In response to Sean, >>> >>>> 1) Riak detects node outage the same way any Erlang system does - when a >>>> message fails to deliver, or the heartbeat maintained by epmd fails. The >>>> default timeout in epmd is 1 minute, which is probably why you're seeing it >>>> take 1 minute to be detected. >>>> >>> Thanks, this is enlightening. >>> >>> 2) If it takes too long (the vnode is overloaded, perhaps, or is just >>>> starting up as a hint partition) to retrieve from any node, the request can >>>> time out. >>>> >>> That makes sense, but I still wonder why this happens even when the >>> quorum is already met by the machines that are responding normally? >>> >>> >>>> 3) You could probably configure epmd to timeout sooner, but then you >>>> become more vulnerable to temporary partitions. YMMV >>>> >>> I may try that - it might be a good fit with my data pattern. >>> >>> Thanks again, >>> --Jay >>> >>> >>> On Mon, Nov 29, 2010 at 4:44 AM, David Smith <[email protected]> wrote: >>> >>>> On Tue, Nov 23, 2010 at 3:33 PM, Jay Adkisson <[email protected]> >>>> wrote: >>>> > (many profuse apologies to Dan - hit "reply" instead of "reply all") >>>> > Alrighty, I've done a little more digging. When I throttle the writes >>>> > heavily (2/sec) and set R and W to 1 all around, the cluster works >>>> just fine >>>> > after I restart the node for about 15-20 seconds. Then the read >>>> request >>>> > hangs for about a minute, until node D disappears from connected_nodes >>>> in >>>> > riak-admin status, at which point it returns the desired value >>>> (although >>>> > sometimes I get a 503): >>>> >>>> Are you seeing any error messages in log/erlang.log.* or >>>> log/sasl-error.log? >>>> >>>> Can you expound on your use case a little -- are you doing a large >>>> insert, or just a random read/write mix? Did you pre-populate the >>>> dataset? Why are you using r=1, instead of relying on quorom for >>>> reads? >>>> >>>> How are you running the riak-admin status to measure the 15-20 seconds? >>>> >>>> Thanks. >>>> >>>> D. >>>> >>> >>> >> >
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
