I looked at the spreadsheet, Matthew, thanks!  It's much more comprehensive 
than what's on the website.

We never had any RAM issues, howver, during the incidents.  All of the machines 
had 35 GB or more free RAM, with no pages swapped out.

I still just don't understand what could have caused these errors.

One of the bad nodes (from yesterday's outage) has 42,000 total .sst files 
across its 25 VNodes (about 1,700 .sst files/VNode).

How could 65,536 filehandles got exhausted?  Would not Riak have had to open 
every single .sst file, *and* do <lots of other stuff>, to hit that limit?

Is there command I can use in the CLI that gives the number of open files?

And now for something only slightly different...

We have surmised that part of how the out-of-files problem appears to manifest 
itself caused our three recent cluster-wide outages.

We are using haproxy with the recommended config, so haproxy is set for 
leastconn.  What seems to have happened is that Riak continued to respond 
positively (at least a good part of the time) to haproxy's default aliveness 
check.  This caused haproxy to send all new connection requests to the bad 
nodes, once existing connections on the bad nodes completed.  Our cluster, in 
very short order, was for most intents and purposes dead.

We saw in all of our apps' logs multitudes of connection (re)attempts, which we 
didn't at first attribute to Riak/HAProxy. I had to get the cluster back up 
very quickly, so I simply did a rolling restart.

This last time (yesterday), I had a little more time to investigate.  All our 
apps returned to normal immediately after the second of the two identified bad 
nodes was restarted.

--
Dave Brady

----- Original Message -----
From: "Evan Vigil-McClanahan" <[email protected]>
To: "Alexander Sicular" <[email protected]>
Cc: "Dave Brady" <[email protected]>, "riak-users" 
<[email protected]>
Sent: Lundi 11 Novembre 2013 22:48:17
Subject: Re: max_files_limit and AAE

AAE in 2.0 will have IO rate limiting to keep it from overwhelming disks.

On Mon, Nov 11, 2013 at 1:33 PM, Alexander Sicular <[email protected]> wrote:
> I'm interested to see how 2.0 fixes this. I too have been bit by the AAE 
> killing servers problem and have had to turn it off (which is thankfully the 
> easiest of the AAE config options). It's kind of antithesis to the easy ops 
> proposition of Riak when a feature that is difficult to configure can kill 
> your entire cluster. Like not just make it slow but like make it unresponsive.
>
>
> @siculars
> http://siculars.posthaven.com
>
> Sent from my iRotaryPhone
>
>> On Nov 11, 2013, at 12:59, Dave Brady <[email protected]> wrote:
>>
>> I have turned off AAE for the time being.
>
> _______________________________________________
> riak-users mailing list
> [email protected]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to