Re: reversing node removal?

Luke Bakken Tue, 15 Apr 2014 10:00:25 -0700

Hi Allen -

I hope you don't take my response as an assignment of blame. I created the
docs issue specifically because this case is not clear nor would I expect
Riak users to "just know" that your situation could happen when they use
"cluster leave" to remove Riak nodes from a cluster.


Every software system makes different decisions about protecting users from
potentially disruptive actions and none protect from all possible failure
scenarios. SQL Server does not protect you from inserting so much data you
fill a disk, for instance.

I'll also follow up with the Product team to discuss more insight into the
outcome of various "riak-admin cluster" operations.

--
Luke Bakken
CSE
[email protected]


On Tue, Apr 15, 2014 at 9:44 AM, Allen Landsidel
<[email protected]>wrote:

> I realize I made a mistake, it would just be nice if the UI could warn me
> that I was about to do so, especially given the consequences.
>
> If it was simply showing me how much space each node was using (forget a
> percentage or anything) that would've been enough to avert disaster. With
> four nodes, if they're over 25% capacity (far lower than any sensible
> warning level in a monitoring system), the cluster leave is going to fail.
>  The more nodes you add to the system, the lower you'd have to set that
> warning threshold to alert you that you're in a state where you can't
> safely retire a node.
>
>
> On 4/15/2014 12:40, Luke Bakken wrote:
>
>> Hi Allen -
>>
>> Failure / node leave situations should be taken into account during
>> cluster capacity planning. I've created an issue to more thoroughly
>> explain this in our documentation:
>>
>> https://github.com/basho/basho_docs/issues/1034
>>
>> --
>> Luke Bakken
>> CSE
>> [email protected] <mailto:[email protected]>
>>
>>
>>
>> On Tue, Apr 15, 2014 at 9:28 AM, Allen Landsidel
>> <[email protected] <mailto:[email protected]>> wrote:
>>
>>     Luke,
>>
>>     I already do use nagios for that, but the disk space was fine before
>>     I told one of the nodes to leave the cluster.  That's my problem --
>>     there was not enough free space in the cluster for it to move all
>>     that nodes data.  It accepted the leave and then ran me out of disk
>>     space on all the other nodes, with no way to abort or recover.
>>
>>     My only option was to add more space to the other nodes (as you
>>     said, adding new nodes will not work until the leave is done), which
>>     is easy enough in a virtualized environment but requires downtime.
>>       In a bare metal environment, it could be catastrophic to the
>> cluster.
>>
>>
>>     On 4/15/2014 12:19, Luke Bakken wrote:
>>
>>         Hi Allen,
>>
>>         Cluster leave does not check for disk space and in general, Riak
>>         is not
>>         aware of how much space it has available to itself (most db
>> systems
>>         don't monitor disk space I think). I'll send a note to product
>>         management about this. We recommend using a monitoring solution
>>         (like
>>         collectd + graphite) to keep an eye on available disk space.
>>
>>
>>         --
>>         Luke Bakken
>>         CSE
>>         [email protected] <mailto:[email protected]>
>>         <mailto:[email protected] <mailto:[email protected]>>
>>
>>

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: reversing node removal?

Reply via email to