Re: [Gluster-devel] Split-brain present and future in afr

2014-05-23 Thread Justin Clift
On 23/05/2014, at 10:17 AM, Pranith Kumar Karampuri wrote:

> 2) That would need more bricks, more processes, more ports.


Meh to "more ports".  We should be moving to a model (maybe in 4.x?)
where we use less ports.  Preferably just one or two in total if its
feasible from a network layer.  Backup applications can manage it,
and they're transferring a tonne of data too. ;)

+ Justin

--
Open Source and Standards @ Red Hat

twitter.com/realjustinclift

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Split-brain present and future in afr

2014-05-23 Thread Jeff Darcy
> > Constantly filtering requests to use either N or N+1 bricks is going to be
> > complicated and hard to debug.  Every data-structure allocation or loop
> > based on replica count will have to be examined, and many will have to be
> > modified.  That's a *lot* of places.  This also overlaps significantly
> > with functionality that can be achieved with data classification (i.e.
> > supporting multiple replica levels within the same volume).  What use case
> > requires that it be implemented within AFR instead of more generally and
> > flexibly?
> 
> 1) It wouldn't still bring in arbiter for replica 2.

It's functionally the same, just implemented in a more modular fashion.
Either way, for the same set of data that was previously replicated
twice, most data would still be replicated twice but some subset would
be replicated three times.  The "policy filter" is just implemented in a
translator dedicated to the purpose, instead of within AFR.  In addition
to being simpler, this keeps the user experience consistent for setting
this vs. other kinds of policies.

> 2) That would need more bricks, more processes, more ports.

Fewer, actually.  Either approach requires that we split bricks (as the
user sees them).  One way we turn N user bricks into N regular bricks
plus N/2 arbiter bricks.  The other way we turn N user bricks into N
bricks for the replica-2 part and another N for the replica-3 part.
That seems like slightly more, but (a) it's the same user view, and (b)
for processes and ports it will actually be less.  Since data
classification is likely to involve splitting bricks many times, and
multi-tenancy likewise, the data classification project is already
scoped to include "multiplexing" multiple bricks into one process on one
port (like HekaFS used to do).  Thus the total number of ports and
processes for an N-brick volume will go back down to N even with the
equivalent of arbiter functionality.

Doing "replica 2.5" as part of data classification instead of within AFR
also has other advantages.  For example, it naturally gives us support
for overlapping replica sets - an often requested feature to spread load
more evenly after a failure.  Perhaps most importantly, it doesn't
require separate implementations or debugging for AFRv1, AFRv2, and NSR.

Let's for once put our effort where it will do us most good, instead of
succumbing to "streetlight effect"[1] yet again and hacking on the
components that are most familiar.

[1] http://en.wikipedia.org/wiki/Streetlight_effect
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Split-brain present and future in afr

2014-05-23 Thread Pranith Kumar Karampuri


- Original Message -
> From: "Jeff Darcy" 
> To: "Pranith Kumar Karampuri" 
> Cc: "Gluster Devel" 
> Sent: Tuesday, May 20, 2014 10:08:12 PM
> Subject: Re: [Gluster-devel] Split-brain present and future in afr
> 
> > 1. Better protection for split-brain over time.
> > 2. Policy based split-brain resolution.
> > 3. Provide better availability with client quorum and replica 2.
> 
> I would add the following:
> 
> (4) Quorum enforcement - any kind - on by default.

For replica - 3 we can do that. For replica 2, quorum implementation at the 
moment is not good enough. Until we fix it correctly may be we should let it 
be. We can revisit that decision once we come up with better solution for 
replica 2.

> 
> (5) Fix the problem of volumes losing quorum because unrelated nodes
> went down (i.e. implement volume-level quorum).
> 
> (6) Better tools for users to resolve split brain themselves.

Agreed. Already in plan for 3.6.

> 
> > For 3, we are planning to introduce arbiter bricks that can be used to
> > determine quorum. The arbiter bricks will be dummy bricks that host only
> > files that will be updated from multiple clients. This will be achieved by
> > bringing about variable replication count for configurable class of files
> > within a volume.
> >  In the case of a replicated volume with one arbiter brick per replica
> >  group,
> >  certain files that are prone to split-brain will be in 3 bricks (2 data
> >  bricks + 1 arbiter brick).  All other files will be present in the regular
> >  data bricks. For example, when oVirt VM disks are hosted on a replica 2
> >  volume, sanlock is used by oVirt for arbitration. sanloclk lease files
> >  will
> >  be written by all clients and VM disks are written by only a single client
> >  at any given point of time. In this scenario, we can place sanlock lease
> >  files on 2 data + 1 arbiter bricks. The VM disk files will only be present
> >  on the 2 data bricks. Client quorum is now determined by looking at 3
> >  bricks instead of 2 and we have better protection when network
> >  split-brains
> >  happen.
> 
> Constantly filtering requests to use either N or N+1 bricks is going to be
> complicated and hard to debug.  Every data-structure allocation or loop
> based on replica count will have to be examined, and many will have to be
> modified.  That's a *lot* of places.  This also overlaps significantly
> with functionality that can be achieved with data classification (i.e.
> supporting multiple replica levels within the same volume).  What use case
> requires that it be implemented within AFR instead of more generally and
> flexibly?

1) It wouldn't still bring in arbiter for replica 2.
2) That would need more bricks, more processes, more ports.

> 
> 

Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Split-brain present and future in afr

2014-05-20 Thread Jeff Darcy
> 1. Better protection for split-brain over time.
> 2. Policy based split-brain resolution.
> 3. Provide better availability with client quorum and replica 2.

I would add the following:

(4) Quorum enforcement - any kind - on by default.

(5) Fix the problem of volumes losing quorum because unrelated nodes
went down (i.e. implement volume-level quorum).

(6) Better tools for users to resolve split brain themselves.

> For 3, we are planning to introduce arbiter bricks that can be used to
> determine quorum. The arbiter bricks will be dummy bricks that host only
> files that will be updated from multiple clients. This will be achieved by
> bringing about variable replication count for configurable class of files
> within a volume.
>  In the case of a replicated volume with one arbiter brick per replica group,
>  certain files that are prone to split-brain will be in 3 bricks (2 data
>  bricks + 1 arbiter brick).  All other files will be present in the regular
>  data bricks. For example, when oVirt VM disks are hosted on a replica 2
>  volume, sanlock is used by oVirt for arbitration. sanloclk lease files will
>  be written by all clients and VM disks are written by only a single client
>  at any given point of time. In this scenario, we can place sanlock lease
>  files on 2 data + 1 arbiter bricks. The VM disk files will only be present
>  on the 2 data bricks. Client quorum is now determined by looking at 3
>  bricks instead of 2 and we have better protection when network split-brains
>  happen.

Constantly filtering requests to use either N or N+1 bricks is going to be
complicated and hard to debug.  Every data-structure allocation or loop
based on replica count will have to be examined, and many will have to be
modified.  That's a *lot* of places.  This also overlaps significantly
with functionality that can be achieved with data classification (i.e.
supporting multiple replica levels within the same volume).  What use case
requires that it be implemented within AFR instead of more generally and
flexibly?

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Split-brain present and future in afr

2014-05-20 Thread Pranith Kumar Karampuri
hi,

Thanks to Vijay Bellur for helping with the re-write of the draft I sent him 
:-).

Present:
Split-brains of files happen in afr today due to 2 primary reasons:

1. Split-brains due to network partition or network split-brains

2. Split-brains due to servers in a replicated group being offline at different 
points in time without self-heal happening in the common period of time when 
the servers were online. For further discussion, this is referred to as 
split-brain over time.

To prevent the occurence of split-brains, we have the following quorum 
implementations in place:

a> Client quorum - Driven by afr (client) and writes are allowed when majority 
of bricks in a replica group are online. Majority is by default N/2 + 1, where 
N is the replication factor for files in a volume.

b> Server quorum - Driven by glusterd (server) and writes are allowed when 
majority of peers are online. Majority by default is N/2 + 1, where N is the 
number of peers in a trusted storage pool.

Both a> and b> primarily safeguard network split-brains. The protection of 
these quorum implementations for split-brain over time scenarios is not very 
high.
Let us consider how replica 3 and replica 2 can be protected against 
split-brains.

Replica 3:
Client quorum is quite effective in this case as writes are only allowed when 
at least 2 of 3 bricks that form a replica group is seen by afr/client. A 
recent fix for a corner case race in client quorum, 
(http://review.gluster.org/7600) makes it very robust. This patch is now part 
of master and release-3.5. We plan to backport it to release-3.4 too.

Replica 2:
Majority for client quorum in a deployment with 2 bricks per replica group is 
2.  Hence availability becomes a problem with replica 2 when either of the 
bricks is offline. To provide better avaialbility for replica-2, the first 
brick in a replica set is provided higher weight and quorum is met as long as 
the first brick is online. If the first brick is offline, then quorum is lost. 

Let us consider the following cases with B1 and B2 forming a replicated set:
B1B2Quorum
Online  OnlineMet
Online  Offline Met
Offline   OfflineNot Met
Offline   OfflineNot Met

Though better availability is provided by client quorum in replica 2 scenarios, 
it is not very optimal and hence an improvement in behavior seems desirable.
Future:

Our  focus in afr going forward would be to solve three problems to provide 
better protection  against split-brains and resolving them:

1. Better protection for split-brain over time.
2. Policy based split-brain resolution.
3. Provide better availability with client quorum and replica 2.

For 1, implementation of outcasting logic will address the problem:
   - An outcast is a copy of a file on which writes have been performed only 
when quorum is met.
   - When a brick goes down and comes back up self-heal daemon will go and mark 
the affected files on the brick that just came back up as outcasts. The outcast 
marking can be implemented even before the brick is declared available to 
regular clients. Once a copy of a file is marked as needing self-heal (or as an 
outcast), writes from clients will not land on that copy till self-heal is 
completed and the outcast tag is removed.

For 2,  we plan to provide commands that can heal based on user configurable 
policies. Examples of policies would be:
 - Pick up the largest file as the winner for resolving a self-heal
-  Choose brick foo as the winner for resolving split-brains
-  Pick up the file with the latest version as the winner (when versioning for 
files is available).

For 3, we are planning to introduce arbiter bricks that can be used to 
determine quorum. The arbiter bricks will be dummy bricks that host only files 
that will be updated from multiple clients. This will be achieved by bringing 
about variable replication count for configurable class of files within a 
volume.
 In the case of a replicated volume with one arbiter brick per replica group, 
certain files that are prone to split-brain will be in 3 bricks (2 data bricks 
+ 1 arbiter brick).  All other files will be present in the regular data 
bricks. For example, when oVirt VM disks are hosted on a replica 2 volume, 
sanlock is used by oVirt for arbitration. sanloclk lease files will be written 
by all clients and VM disks are written by only a single client at any given 
point of time. In this scenario, we can place sanlock lease files on 2 data + 1 
arbiter bricks. The VM disk files will only be present on the 2 data bricks. 
Client quorum is now determined by looking at 3 bricks instead of 2 and we have 
better protection when network split-brains happen.
 
 A combination of 1. and 3. does s