Sadly, this is not unexpected on EC2, especially if your storage is on EBS. In a BoF session at the Surge Conference last year, Brendan Gregg measured about 1.5sec of I/O silence to EBS on an OmniOS instance running Riak that was under load (using DTrace of course). Granted, this is just one case, but I think the symptoms you see are indicative of a similar problem. Others have told me they have seen large iowait/delay even on ephemeral storage.
Now, to Riak's part. That "local put coordination" stuff (sometimes called vnode_vclocks), while increasing latency in some cases, also improves convergence and reduces the appearance of spurious siblings. It also removes the need for clients to specify an identifier, instead using the vnode's identifier in the vector clock. Which vnode does the coordination is somewhat randomized, so 1 out of N times you send a put request, you could get that slow node. I hope that helps understand the why's... not sure I know how I would fix your problem in general, other than the typical "get bigger instances" solution. On Wed, Jun 26, 2013 at 7:47 AM, Andreas Hasselberg < [email protected]> wrote: > Hi, > > We are having problems with the io on one of our amazon instances being > very slow for short periods of time. When this happens we get some put > timeouts even though all the other amazon instances seems to be fine. I > have done some investigations here: > https://gist.github.com/anha0825/5866757. Could anyone confirm this or > point me in a better direction? > > Thanks! > Andreas > > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > -- Sean Cribbs <[email protected]> Software Engineer Basho Technologies, Inc. http://basho.com/
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
