Re: riak on SSDs - how to manage potential SSD failures

Matthew Von-Maszewski Tue, 27 Nov 2012 13:14:19 -0800

Alex,

Your question is outside my knowledge space.  I had to ask around.  This is the 
message that came back:


You can grab _ALL_ stats for a node over HTTP at HOST:PORT/stats. This gets you 
a big JSON blob of all stats. The stat in question is in the JSON with key 
'leveldb_read_block_error', the value will be either "undefined" or an integer. 
The former if there is no leveldb backend.


Matthew


On Nov 27, 2012, at 1:37 PM, Alex Babkin <[email protected]> wrote:

> Thank you for quick response Matt
> 
> So you are saying that i will have facilities in Riak 1.3 to handle these 
> errors in application layer? automatically by riak?
> 
> Alex
> 
> 
> On Mon, Nov 26, 2012 at 2:09 PM, Matthew Von-Maszewski <[email protected]> 
> wrote:
> Alex,
> 
> The eleveldb backend creates a CRC for every item placed on the disk.  You 
> can activate the test of the CRC on every read by adding:
> 
>    {verify_checksums, true},
> 
> to the "{eleveldb " portion of app.config.  With riak 1.2, you must manually 
> monitor each vnode directory for the lost/BLOCKS.bad file changing size.  It 
> only increases upon read operations detecting a CRC and/or compression 
> corruption error.
> 
> Manually monitoring the BLOCKS.bad file is tacky (my apologies).  The 
> upcoming 1.3 release will populate riak admin with a counter of errors seen.  
> But that code is still weeks from release.
> 
> Matthew
> 
> On Nov 26, 2012, at 1:25 PM, Alex Babkin <[email protected]> wrote:
> 
> > Hi all
> >
> > first post here, so please be kind :)
> >
> > I have plans to build an experimental riak cluster out of cheap ARM 
> > computing parts and consumer grade SSDs to measure performance and 
> > experiment to assess production viability
> > I plan to use levelDB as the backend
> >
> > One thing to be concerned of, in light of various SSD failure stories, is 
> > of course a scenario of SSD failure and also the way it fails (some parts 
> > of SSD space just aren't writable anymore, but still readable, i.e stuck at 
> > some constant value). This may potentially result in a scenario where a 
> > replicated record on two clusters, one with working SSD and one with 
> > faulty, will have different data. Will riak try to account for this 
> > scenario?
> >
> > I'm trying to think of ways to mitigate this risk of nodes failing due to 
> > these SSD failures or at least get an early indication of a failure 
> > (however insignificant it may be).
> > Guess my first question should be "Does riak provide any form of checksums 
> > or what not on the data it reads/writes, or it blindly trusts that the 
> > backend/filesystem reads/writes data correctly?"
> >
> > If not, are there any other tricks people use to trigger some alarm bells 
> > that an SSD is 'going' ?
> >
> > Thanks
> > Alex
> >
> >
> >
> >
> >
> > _______________________________________________
> > riak-users mailing list
> > [email protected]
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
>

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: riak on SSDs - how to manage potential SSD failures

Reply via email to