Hi Luke, 

Yes, though that's kind of intensive. 


Thanks! 


-- 
Dave Brady 

----- Original Message -----

From: "Luke Bakken" <[email protected]> 
To: "Dave Brady" <[email protected]> 
Cc: "riak-users" <[email protected]> 
Sent: Mardi 12 Novembre 2013 17:18:40 
Subject: Re: max_files_limit and AAE 


Hi Dave, 


You can use the lsof command to find files opened by the riak user. 



-- 
Luke Bakken 
CSE 
[email protected] 


On Tue, Nov 12, 2013 at 8:08 AM, Dave Brady < [email protected] > wrote: 


I looked at the spreadsheet, Matthew, thanks! It's much more comprehensive than 
what's on the website. 

We never had any RAM issues, howver, during the incidents. All of the machines 
had 35 GB or more free RAM, with no pages swapped out. 

I still just don't understand what could have caused these errors. 

One of the bad nodes (from yesterday's outage) has 42,000 total .sst files 
across its 25 VNodes (about 1,700 .sst files/VNode). 

How could 65,536 filehandles got exhausted? Would not Riak have had to open 
every single .sst file, *and* do <lots of other stuff>, to hit that limit? 

Is there command I can use in the CLI that gives the number of open files? 

And now for something only slightly different... 

We have surmised that part of how the out-of-files problem appears to manifest 
itself caused our three recent cluster-wide outages. 

We are using haproxy with the recommended config, so haproxy is set for 
leastconn. What seems to have happened is that Riak continued to respond 
positively (at least a good part of the time) to haproxy's default aliveness 
check. This caused haproxy to send all new connection requests to the bad 
nodes, once existing connections on the bad nodes completed. Our cluster, in 
very short order, was for most intents and purposes dead. 

We saw in all of our apps' logs multitudes of connection (re)attempts, which we 
didn't at first attribute to Riak/HAProxy. I had to get the cluster back up 
very quickly, so I simply did a rolling restart. 

This last time (yesterday), I had a little more time to investigate. All our 
apps returned to normal immediately after the second of the two identified bad 
nodes was restarted. 

-- 
Dave Brady 


_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to