Re: reversing node removal?

Allen Landsidel Tue, 15 Apr 2014 10:07:25 -0700

Luke,

I understand. I was responding to the capacity planning bit; thecluster had more than enough capacity for day to day operations, but notnearly enough to survive a node retiring.

I view it a bit differently from SQL Server and other database productssince those are single-machine solutions. The clustering productsavailable for them do warn you if you attempt to retire a node from acluster and don't have the resources for the other nodes to take over,or prevent it if it's a hard resource limit like disk space.

Since my cluster was virtualized, I'm almost 'out of the woods'. Theretiring node is down to 6% of the cluster data and should be done sometime tonight. After that, hopefully, I can add some smaller nodes tothe cluster and retire the three that are now far larger than I'd likethem to be, disk-wise.


Thanks!


On 4/15/2014 12:59, Luke Bakken wrote:

Hi Allen -

I hope you don't take my response as an assignment of blame. I created
the docs issue specifically because this case is not clear nor would I
expect Riak users to "just know" that your situation could happen when
they use "cluster leave" to remove Riak nodes from a cluster.

Every software system makes different decisions about protecting users
from potentially disruptive actions and none protect from all possible
failure scenarios. SQL Server does not protect you from inserting so
much data you fill a disk, for instance.

I'll also follow up with the Product team to discuss more insight into
the outcome of various "riak-admin cluster" operations.

--
Luke Bakken
CSE
[email protected] <mailto:[email protected]>


On Tue, Apr 15, 2014 at 9:44 AM, Allen Landsidel
<[email protected] <mailto:[email protected]>> wrote:

    I realize I made a mistake, it would just be nice if the UI could
    warn me that I was about to do so, especially given the consequences.

    If it was simply showing me how much space each node was using
    (forget a percentage or anything) that would've been enough to avert
    disaster. With four nodes, if they're over 25% capacity (far lower
    than any sensible warning level in a monitoring system), the cluster
    leave is going to fail.  The more nodes you add to the system, the
    lower you'd have to set that warning threshold to alert you that
    you're in a state where you can't safely retire a node.


    On 4/15/2014 12:40, Luke Bakken wrote:

        Hi Allen -

        Failure / node leave situations should be taken into account during
        cluster capacity planning. I've created an issue to more thoroughly
        explain this in our documentation:

        https://github.com/basho/__basho_docs/issues/1034
        <https://github.com/basho/basho_docs/issues/1034>

        --
        Luke Bakken
        CSE
        [email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>



        On Tue, Apr 15, 2014 at 9:28 AM, Allen Landsidel
        <[email protected] <mailto:[email protected]>
        <mailto:landsidel.allen@gmail.__com
        <mailto:[email protected]>>> wrote:

             Luke,

             I already do use nagios for that, but the disk space was
        fine before
             I told one of the nodes to leave the cluster.  That's my
        problem --
             there was not enough free space in the cluster for it to
        move all
             that nodes data.  It accepted the leave and then ran me out
        of disk
             space on all the other nodes, with no way to abort or recover.

             My only option was to add more space to the other nodes (as you
             said, adding new nodes will not work until the leave is
        done), which
             is easy enough in a virtualized environment but requires
        downtime.
               In a bare metal environment, it could be catastrophic to
        the cluster.


             On 4/15/2014 12:19, Luke Bakken wrote:

                 Hi Allen,

                 Cluster leave does not check for disk space and in
        general, Riak
                 is not
                 aware of how much space it has available to itself
        (most db systems
                 don't monitor disk space I think). I'll send a note to
        product
                 management about this. We recommend using a monitoring
        solution
                 (like
                 collectd + graphite) to keep an eye on available disk
        space.


                 --
                 Luke Bakken
                 CSE
        [email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>


_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: reversing node removal?

Reply via email to