Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-07-07 Thread Xavier Hernandez

On 07/07/17 11:25, Pranith Kumar Karampuri wrote:



On Fri, Jul 7, 2017 at 2:46 PM, Xavier Hernandez > wrote:

On 07/07/17 10:12, Pranith Kumar Karampuri wrote:



On Fri, Jul 7, 2017 at 1:13 PM, Xavier Hernandez

>>
wrote:

Hi Pranith,

On 05/07/17 12:28, Pranith Kumar Karampuri wrote:



On Tue, Jul 4, 2017 at 2:26 PM, Xavier Hernandez

>
 

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-07-07 Thread Xavier Hernandez

Hi Pranith,

On 05/07/17 12:28, Pranith Kumar Karampuri wrote:



On Tue, Jul 4, 2017 at 2:26 PM, Xavier Hernandez > wrote:

Hi Pranith,

On 03/07/17 08:33, Pranith Kumar Karampuri wrote:

Xavi,
  Now that the change has been reverted, we can resume this
discussion and decide on the exact format that considers, tier, dht,
afr, ec. People working geo-rep/dht/afr/ec had an internal
discussion
and we all agreed that this proposal would be a good way forward. I
think once we agree on the format and decide on the initial
encoding/decoding functions of the xattr and this change is
merged, we
can send patches on afr/ec/dht and geo-rep to take it to closure.

Could you propose the new format you have in mind that considers
all of
the xlators?


My idea was to create a new xattr not bound to any particular
function but which could give enough information to be used in many
places.

Currently we have another attribute called glusterfs.pathinfo that
returns hierarchical information about the location of a file. Maybe
we can extend this to unify all these attributes into a single
feature that could be used for multiple purposes.

Since we have time to discuss it, I would like to design it with
more information than we already talked.

First of all, the amount of information that this attribute can
contain is quite big if we expect to have volumes with thousands of
bricks. Even in the most simple case of returning only an UUID, we
can easily go beyond the limit of 64KB.

Consider also, for example, what shard should return when pathinfo
is requested for a file. Probably it should return a list of shards,
each one with all its associated pathinfo. We are talking about big
amounts of data here.

I think this kind of information doesn't fit very well in an
extended attribute. Another think to consider is that most probably
the requester of the data only needs a fragment of it, so we are
generating big amounts of data only to be parsed and reduced later,
dismissing most of it.

What do you think about using a very special virtual file to manage
all this information ? it could be easily read using normal read
fops, so it could manage big amounts of data easily. Also, accessing
only to some parts of the file we could go directly where we want,
avoiding the read of all remaining data.

A very basic idea could be this:

Each xlator would have a reserved area of the file. We can reserve
up to 4GB per xlator (32 bits). The remaining 32 bits of the offset
would indicate the xlator we want to access.

At offset 0 we have generic information about the volume. One of the
the things that this information should include is a basic hierarchy
of the whole volume and the offset for each xlator.

After reading this, the user will seek to the desired offset and
read the information related to the xlator it is interested in.

All the information should be stored in a format easily extensible
that will be kept compatible even if new information is added in the
future (for example doing special mappings of the 32 bits offsets
reserved for the xlator).

For example we can reserve the first megabyte of the xlator area to
have a mapping of attributes with its respective offset.

I think that using a binary format would simplify all this a lot.

Do you think this is a way to explore or should I stop wasting time
here ?


I think this just became a very big feature :-). Shall we just live with
it the way it is now?


I supposed it...

Only thing we need to check is if shard needs to handle this xattr. If 
so, what it should return ? only the UUID's corresponding to the first 
shard or the UUID's of all bricks containing at least one shard ? I 
guess that the first one is enough, but just to be sure...


My proposal was to implement a new xattr, for example glusterfs.layout, 
that contains enough information to be usable in all current use cases.


The idea would be that each xlator that makes a significant change in 
the way or the place where files are stored, should put information in 
this xattr. The information should include:


* Type (basically AFR, EC, DHT, ...)
* Basic configuration (replication and arbiter for AFR, data and 
redundancy for EC, # subvolumes for DHT, shard size for sharding, ...)

* Quorum imposed by the xlator
* UUID data comming from subvolumes (sorted by brick position)
* It should be easily extensible in the future

The last point is very important to avoid the issues we have seen now. 
We must be able to incorporate more information without breaking 
backward compatibility. To do so, we can add tags for each value.


For example, a distribute 2, replica 2 volume with 1 

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-07-05 Thread Pranith Kumar Karampuri
On Tue, Jul 4, 2017 at 2:26 PM, Xavier Hernandez 
wrote:

> Hi Pranith,
>
> On 03/07/17 08:33, Pranith Kumar Karampuri wrote:
>
>> Xavi,
>>   Now that the change has been reverted, we can resume this
>> discussion and decide on the exact format that considers, tier, dht,
>> afr, ec. People working geo-rep/dht/afr/ec had an internal discussion
>> and we all agreed that this proposal would be a good way forward. I
>> think once we agree on the format and decide on the initial
>> encoding/decoding functions of the xattr and this change is merged, we
>> can send patches on afr/ec/dht and geo-rep to take it to closure.
>>
>> Could you propose the new format you have in mind that considers all of
>> the xlators?
>>
>
> My idea was to create a new xattr not bound to any particular function but
> which could give enough information to be used in many places.
>
> Currently we have another attribute called glusterfs.pathinfo that returns
> hierarchical information about the location of a file. Maybe we can extend
> this to unify all these attributes into a single feature that could be used
> for multiple purposes.
>
> Since we have time to discuss it, I would like to design it with more
> information than we already talked.
>
> First of all, the amount of information that this attribute can contain is
> quite big if we expect to have volumes with thousands of bricks. Even in
> the most simple case of returning only an UUID, we can easily go beyond the
> limit of 64KB.
>
> Consider also, for example, what shard should return when pathinfo is
> requested for a file. Probably it should return a list of shards, each one
> with all its associated pathinfo. We are talking about big amounts of data
> here.
>
> I think this kind of information doesn't fit very well in an extended
> attribute. Another think to consider is that most probably the requester of
> the data only needs a fragment of it, so we are generating big amounts of
> data only to be parsed and reduced later, dismissing most of it.
>
> What do you think about using a very special virtual file to manage all
> this information ? it could be easily read using normal read fops, so it
> could manage big amounts of data easily. Also, accessing only to some parts
> of the file we could go directly where we want, avoiding the read of all
> remaining data.
>
> A very basic idea could be this:
>
> Each xlator would have a reserved area of the file. We can reserve up to
> 4GB per xlator (32 bits). The remaining 32 bits of the offset would
> indicate the xlator we want to access.
>
> At offset 0 we have generic information about the volume. One of the the
> things that this information should include is a basic hierarchy of the
> whole volume and the offset for each xlator.
>
> After reading this, the user will seek to the desired offset and read the
> information related to the xlator it is interested in.
>
> All the information should be stored in a format easily extensible that
> will be kept compatible even if new information is added in the future (for
> example doing special mappings of the 32 bits offsets reserved for the
> xlator).
>
> For example we can reserve the first megabyte of the xlator area to have a
> mapping of attributes with its respective offset.
>
> I think that using a binary format would simplify all this a lot.
>
> Do you think this is a way to explore or should I stop wasting time here ?
>

I think this just became a very big feature :-). Shall we just live with it
the way it is now?


>
> Xavi
>
>
>>
>>
>> On Wed, Jun 21, 2017 at 2:08 PM, Karthik Subrahmanya
>> > wrote:
>>
>>
>>
>> On Wed, Jun 21, 2017 at 1:56 PM, Xavier Hernandez
>> > wrote:
>>
>> That's ok. I'm currently unable to write a patch for this on ec.
>>
>> Sunil is working on this patch.
>>
>> ~Karthik
>>
>> If no one can do it, I can try to do it in 6 - 7 hours...
>>
>> Xavi
>>
>>
>> On Wednesday, June 21, 2017 09:48 CEST, Pranith Kumar Karampuri
>> > wrote:
>>
>>
>>>
>>> On Wed, Jun 21, 2017 at 1:00 PM, Xavier Hernandez
>>> > wrote:
>>>
>>> I'm ok with reverting node-uuid content to the previous
>>> format and create a new xattr for the new format.
>>> Currently, only rebalance will use it.
>>>
>>> Only thing to consider is what can happen if we have a
>>> half upgraded cluster where some clients have this change
>>> and some not. Can rebalance work in this situation ? if
>>> so, could there be any issue ?
>>>
>>>
>>> I think there shouldn't be any problem, because this is
>>> in-memory xattr so layers below afr/ec will only see node-uuid
>>> xattr.
>>> This also 

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-07-04 Thread Xavier Hernandez

Hi Pranith,

On 03/07/17 08:33, Pranith Kumar Karampuri wrote:

Xavi,
  Now that the change has been reverted, we can resume this
discussion and decide on the exact format that considers, tier, dht,
afr, ec. People working geo-rep/dht/afr/ec had an internal discussion
and we all agreed that this proposal would be a good way forward. I
think once we agree on the format and decide on the initial
encoding/decoding functions of the xattr and this change is merged, we
can send patches on afr/ec/dht and geo-rep to take it to closure.

Could you propose the new format you have in mind that considers all of
the xlators?


My idea was to create a new xattr not bound to any particular function 
but which could give enough information to be used in many places.


Currently we have another attribute called glusterfs.pathinfo that 
returns hierarchical information about the location of a file. Maybe we 
can extend this to unify all these attributes into a single feature that 
could be used for multiple purposes.


Since we have time to discuss it, I would like to design it with more 
information than we already talked.


First of all, the amount of information that this attribute can contain 
is quite big if we expect to have volumes with thousands of bricks. Even 
in the most simple case of returning only an UUID, we can easily go 
beyond the limit of 64KB.


Consider also, for example, what shard should return when pathinfo is 
requested for a file. Probably it should return a list of shards, each 
one with all its associated pathinfo. We are talking about big amounts 
of data here.


I think this kind of information doesn't fit very well in an extended 
attribute. Another think to consider is that most probably the requester 
of the data only needs a fragment of it, so we are generating big 
amounts of data only to be parsed and reduced later, dismissing most of it.


What do you think about using a very special virtual file to manage all 
this information ? it could be easily read using normal read fops, so it 
could manage big amounts of data easily. Also, accessing only to some 
parts of the file we could go directly where we want, avoiding the read 
of all remaining data.


A very basic idea could be this:

Each xlator would have a reserved area of the file. We can reserve up to 
4GB per xlator (32 bits). The remaining 32 bits of the offset would 
indicate the xlator we want to access.


At offset 0 we have generic information about the volume. One of the the 
things that this information should include is a basic hierarchy of the 
whole volume and the offset for each xlator.


After reading this, the user will seek to the desired offset and read 
the information related to the xlator it is interested in.


All the information should be stored in a format easily extensible that 
will be kept compatible even if new information is added in the future 
(for example doing special mappings of the 32 bits offsets reserved for 
the xlator).


For example we can reserve the first megabyte of the xlator area to have 
a mapping of attributes with its respective offset.


I think that using a binary format would simplify all this a lot.

Do you think this is a way to explore or should I stop wasting time here ?

Xavi





On Wed, Jun 21, 2017 at 2:08 PM, Karthik Subrahmanya
> wrote:



On Wed, Jun 21, 2017 at 1:56 PM, Xavier Hernandez
> wrote:

That's ok. I'm currently unable to write a patch for this on ec.

Sunil is working on this patch.

~Karthik

If no one can do it, I can try to do it in 6 - 7 hours...

Xavi


On Wednesday, June 21, 2017 09:48 CEST, Pranith Kumar Karampuri
> wrote:




On Wed, Jun 21, 2017 at 1:00 PM, Xavier Hernandez
> wrote:

I'm ok with reverting node-uuid content to the previous
format and create a new xattr for the new format.
Currently, only rebalance will use it.

Only thing to consider is what can happen if we have a
half upgraded cluster where some clients have this change
and some not. Can rebalance work in this situation ? if
so, could there be any issue ?


I think there shouldn't be any problem, because this is
in-memory xattr so layers below afr/ec will only see node-uuid
xattr.
This also gives us a chance to do whatever we want to do in
future with this xattr without any problems about backward
compatibility.

You can check

https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507


for how karthik implemented this in AFR (this got 

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-07-03 Thread Pranith Kumar Karampuri
Xavi,
  Now that the change has been reverted, we can resume this discussion
and decide on the exact format that considers, tier, dht, afr, ec. People
working geo-rep/dht/afr/ec had an internal discussion and we all agreed
that this proposal would be a good way forward. I think once we agree on
the format and decide on the initial encoding/decoding functions of the
xattr and this change is merged, we can send patches on afr/ec/dht and
geo-rep to take it to closure.

Could you propose the new format you have in mind that considers all of the
xlators?



On Wed, Jun 21, 2017 at 2:08 PM, Karthik Subrahmanya 
wrote:

>
>
> On Wed, Jun 21, 2017 at 1:56 PM, Xavier Hernandez 
> wrote:
>
>> That's ok. I'm currently unable to write a patch for this on ec.
>
> Sunil is working on this patch.
>
> ~Karthik
>
>> If no one can do it, I can try to do it in 6 - 7 hours...
>>
>> Xavi
>>
>>
>> On Wednesday, June 21, 2017 09:48 CEST, Pranith Kumar Karampuri <
>> pkara...@redhat.com> wrote:
>>
>>
>>
>>
>> On Wed, Jun 21, 2017 at 1:00 PM, Xavier Hernandez 
>> wrote:
>>>
>>> I'm ok with reverting node-uuid content to the previous format and
>>> create a new xattr for the new format. Currently, only rebalance will use
>>> it.
>>>
>>> Only thing to consider is what can happen if we have a half upgraded
>>> cluster where some clients have this change and some not. Can rebalance
>>> work in this situation ? if so, could there be any issue ?
>>
>>
>> I think there shouldn't be any problem, because this is in-memory xattr
>> so layers below afr/ec will only see node-uuid xattr.
>> This also gives us a chance to do whatever we want to do in future with
>> this xattr without any problems about backward compatibility.
>>
>> You can check https://review.gluster.org/#/c
>> /17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507 for how karthik
>> implemented this in AFR (this got merged accidentally yesterday, but looks
>> like this is what we are settling on)
>>
>>
>>>
>>> Xavi
>>>
>>>
>>> On Wednesday, June 21, 2017 06:56 CEST, Pranith Kumar Karampuri <
>>> pkara...@redhat.com> wrote:
>>>
>>>
>>>
>>>
>>> On Wed, Jun 21, 2017 at 10:07 AM, Nithya Balachandran <
>>> nbala...@redhat.com> wrote:


 On 20 June 2017 at 20:38, Aravinda  wrote:
>
> On 06/20/2017 06:02 PM, Pranith Kumar Karampuri wrote:
>
> Xavi, Aravinda and I had a discussion on #gluster-dev and we agreed to
> go with the format Aravinda suggested for now and in future we wanted some
> more changes for dht to detect which subvolume went down came back up, at
> that time we will revisit the solution suggested by Xavi.
>
> Susanth is doing the dht changes
> Aravinda is doing geo-rep changes
>
> Done. Geo-rep patch sent for review https://review.gluster.org/17582
>
>

 The proposed changes to the node-uuid behaviour (while good) are going
 to break tiering . Tiering changes will take a little more time to be coded
 and tested.

 As this is a regression for 3.11 and a blocker for 3.11.1, I suggest we
 go back to the original node-uuid behaviour for now so as to unblock the
 release and target the proposed changes for the next 3.11 releases.

>>>
>>> Let me see if I understand the changes correctly. We are restoring the
>>> behavior of node-uuid xattr and adding a new xattr for parallel rebalance
>>> for both afr and ec, correct? Otherwise that is one more regression. If
>>> yes, we will also wait for Xavi's inputs. Jeff accidentally merged the afr
>>> patch yesterday which does these changes. If everyone is in agreement, we
>>> will leave it as is and add similar changes in ec as well. If we are not in
>>> agreement, then we will let the discussion progress :-)
>>>
>>>


 Regards,
 Nithya

> --
> Aravinda
>
>
>
> Thanks to all of you guys for the discussions!
>
> On Tue, Jun 20, 2017 at 5:05 PM, Xavier Hernandez <
> xhernan...@datalab.es> wrote:
>>
>> Hi Aravinda,
>>
>> On 20/06/17 12:42, Aravinda wrote:
>>>
>>> I think following format can be easily adopted by all components
>>>
>>> UUIDs of a subvolume are seperated by space and subvolumes are
>>> separated
>>> by comma
>>>
>>> For example, node1 and node2 are replica with U1 and U2 UUIDs
>>> respectively and
>>> node3 and node4 are replica with U3 and U4 UUIDs respectively
>>>
>>> node-uuid can return "U1 U2,U3 U4"
>>
>>
>> While this is ok for current implementation, I think this can be
>> insufficient if there are more layers of xlators that require to indicate
>> some sort of grouping. Some representation that can represent hierarchy
>> would be better. For example: "(U1 U2) (U3 U4)" (we can use spaces or 
>> comma
>> as a separator).
>>
>>>
>>>
>>> Geo-rep can split 

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-21 Thread Karthik Subrahmanya
On Wed, Jun 21, 2017 at 1:56 PM, Xavier Hernandez 
wrote:

> That's ok. I'm currently unable to write a patch for this on ec.

Sunil is working on this patch.

~Karthik

> If no one can do it, I can try to do it in 6 - 7 hours...
>
> Xavi
>
>
> On Wednesday, June 21, 2017 09:48 CEST, Pranith Kumar Karampuri <
> pkara...@redhat.com> wrote:
>
>
>
>
> On Wed, Jun 21, 2017 at 1:00 PM, Xavier Hernandez 
> wrote:
>>
>> I'm ok with reverting node-uuid content to the previous format and create
>> a new xattr for the new format. Currently, only rebalance will use it.
>>
>> Only thing to consider is what can happen if we have a half upgraded
>> cluster where some clients have this change and some not. Can rebalance
>> work in this situation ? if so, could there be any issue ?
>
>
> I think there shouldn't be any problem, because this is in-memory xattr so
> layers below afr/ec will only see node-uuid xattr.
> This also gives us a chance to do whatever we want to do in future with
> this xattr without any problems about backward compatibility.
>
> You can check https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/
> src/afr-inode-read.c@1507 for how karthik implemented this in AFR (this
> got merged accidentally yesterday, but looks like this is what we are
> settling on)
>
>
>>
>> Xavi
>>
>>
>> On Wednesday, June 21, 2017 06:56 CEST, Pranith Kumar Karampuri <
>> pkara...@redhat.com> wrote:
>>
>>
>>
>>
>> On Wed, Jun 21, 2017 at 10:07 AM, Nithya Balachandran <
>> nbala...@redhat.com> wrote:
>>>
>>>
>>> On 20 June 2017 at 20:38, Aravinda  wrote:

 On 06/20/2017 06:02 PM, Pranith Kumar Karampuri wrote:

 Xavi, Aravinda and I had a discussion on #gluster-dev and we agreed to
 go with the format Aravinda suggested for now and in future we wanted some
 more changes for dht to detect which subvolume went down came back up, at
 that time we will revisit the solution suggested by Xavi.

 Susanth is doing the dht changes
 Aravinda is doing geo-rep changes

 Done. Geo-rep patch sent for review https://review.gluster.org/17582


>>>
>>> The proposed changes to the node-uuid behaviour (while good) are going
>>> to break tiering . Tiering changes will take a little more time to be coded
>>> and tested.
>>>
>>> As this is a regression for 3.11 and a blocker for 3.11.1, I suggest we
>>> go back to the original node-uuid behaviour for now so as to unblock the
>>> release and target the proposed changes for the next 3.11 releases.
>>>
>>
>> Let me see if I understand the changes correctly. We are restoring the
>> behavior of node-uuid xattr and adding a new xattr for parallel rebalance
>> for both afr and ec, correct? Otherwise that is one more regression. If
>> yes, we will also wait for Xavi's inputs. Jeff accidentally merged the afr
>> patch yesterday which does these changes. If everyone is in agreement, we
>> will leave it as is and add similar changes in ec as well. If we are not in
>> agreement, then we will let the discussion progress :-)
>>
>>
>>>
>>>
>>> Regards,
>>> Nithya
>>>
 --
 Aravinda



 Thanks to all of you guys for the discussions!

 On Tue, Jun 20, 2017 at 5:05 PM, Xavier Hernandez <
 xhernan...@datalab.es> wrote:
>
> Hi Aravinda,
>
> On 20/06/17 12:42, Aravinda wrote:
>>
>> I think following format can be easily adopted by all components
>>
>> UUIDs of a subvolume are seperated by space and subvolumes are
>> separated
>> by comma
>>
>> For example, node1 and node2 are replica with U1 and U2 UUIDs
>> respectively and
>> node3 and node4 are replica with U3 and U4 UUIDs respectively
>>
>> node-uuid can return "U1 U2,U3 U4"
>
>
> While this is ok for current implementation, I think this can be
> insufficient if there are more layers of xlators that require to indicate
> some sort of grouping. Some representation that can represent hierarchy
> would be better. For example: "(U1 U2) (U3 U4)" (we can use spaces or 
> comma
> as a separator).
>
>>
>>
>> Geo-rep can split by "," and then split by space and take first UUID
>> DHT can split the value by space or comma and get unique UUIDs list
>
>
> This doesn't solve the problem I described in the previous email. Some
> more logic will need to be added to avoid more than one node from each
> replica-set to be active. If we have some explicit hierarchy information 
> in
> the node-uuid value, more decisions can be taken.
>
> An initial proposal I made was this:
>
> DHT[2](AFR[2,0](NODE(U1), NODE(U2)), AFR[2,0](NODE(U1), NODE(U2)))
>
> This is harder to parse, but gives a lot of information: DHT with 2
> subvolumes, each subvolume is an AFR with replica 2 and no arbiters. It's
> also easily extensible with any new xlator that changes the 

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-21 Thread Xavier Hernandez

That's ok. I'm currently unable to write a patch for this on ec. If no one can 
do it, I can try to do it in 6 - 7 hours...

Xavi

On Wednesday, June 21, 2017 09:48 CEST, Pranith Kumar Karampuri 
 wrote:
   On Wed, Jun 21, 2017 at 1:00 PM, Xavier Hernandez  
wrote:I'm ok with reverting node-uuid content to the previous format and create 
a new xattr for the new format. Currently, only rebalance will use it.

Only thing to consider is what can happen if we have a half upgraded cluster 
where some clients have this change and some not. Can rebalance work in this 
situation ? if so, could there be any issue ? I think there shouldn't be any 
problem, because this is in-memory xattr so layers below afr/ec will only see 
node-uuid xattr.This also gives us a chance to do whatever we want to do in 
future with this xattr without any problems about backward compatibility.
 You can check 
https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
 for how karthik implemented this in AFR (this got merged accidentally 
yesterday, but looks like this is what we are settling on) 
Xavi

On Wednesday, June 21, 2017 06:56 CEST, Pranith Kumar Karampuri 
 wrote:
   On Wed, Jun 21, 2017 at 10:07 AM, Nithya Balachandran  
wrote: On 20 June 2017 at 20:38, Aravinda  wrote:On 
06/20/2017 06:02 PM, Pranith Kumar Karampuri wrote:Xavi, Aravinda and I had a 
discussion on #gluster-dev and we agreed to go with the format Aravinda 
suggested for now and in future we wanted some more changes for dht to detect 
which subvolume went down came back up, at that time we will revisit the 
solution suggested by Xavi.
 Susanth is doing the dht changesAravinda is doing geo-rep changes Done. 
Geo-rep patch sent for review https://review.gluster.org/17582
  The proposed changes to the node-uuid behaviour (while good) are going to 
break tiering . Tiering changes will take a little more time to be coded and 
tested.  As this is a regression for 3.11 and a blocker for 3.11.1, I suggest 
we go back to the original node-uuid behaviour for now so as to unblock the 
release and target the proposed changes for the next 3.11 releases. Let me see 
if I understand the changes correctly. We are restoring the behavior of 
node-uuid xattr and adding a new xattr for parallel rebalance for both afr and 
ec, correct? Otherwise that is one more regression. If yes, we will also wait 
for Xavi's inputs. Jeff accidentally merged the afr patch yesterday which does 
these changes. If everyone is in agreement, we will leave it as is and add 
similar changes in ec as well. If we are not in agreement, then we will let the 
discussion progress :-)   Regards,Nithya--
Aravinda  Thanks to all of you guys for the discussions! On Tue, Jun 20, 2017 
at 5:05 PM, Xavier Hernandez  wrote:Hi Aravinda,

On 20/06/17 12:42, Aravinda wrote:I think following format can be easily 
adopted by all components

UUIDs of a subvolume are seperated by space and subvolumes are separated
by comma

For example, node1 and node2 are replica with U1 and U2 UUIDs
respectively and
node3 and node4 are replica with U3 and U4 UUIDs respectively

node-uuid can return "U1 U2,U3 U4"
While this is ok for current implementation, I think this can be insufficient 
if there are more layers of xlators that require to indicate some sort of 
grouping. Some representation that can represent hierarchy would be better. For 
example: "(U1 U2) (U3 U4)" (we can use spaces or comma as a separator).
 
Geo-rep can split by "," and then split by space and take first UUID
DHT can split the value by space or comma and get unique UUIDs list
This doesn't solve the problem I described in the previous email. Some more 
logic will need to be added to avoid more than one node from each replica-set 
to be active. If we have some explicit hierarchy information in the node-uuid 
value, more decisions can be taken.

An initial proposal I made was this:

DHT[2](AFR[2,0](NODE(U1), NODE(U2)), AFR[2,0](NODE(U1), NODE(U2)))

This is harder to parse, but gives a lot of information: DHT with 2 subvolumes, 
each subvolume is an AFR with replica 2 and no arbiters. It's also easily 
extensible with any new xlator that changes the layout.

However maybe this is not the moment to do this, and probably we could 
implement this in a new xattr with a better name.

Xavi 
Another question is about the behavior when a node is down, existing
node-uuid xattr will not return that UUID if a node is down. What is the
behavior with the proposed xattr?

Let me know your thoughts.

regards
Aravinda VK

On 06/20/2017 03:06 PM, Aravinda wrote:Hi Xavi,

On 06/20/2017 02:51 PM, Xavier Hernandez wrote:Hi Aravinda,

On 20/06/17 11:05, Pranith Kumar Karampuri wrote:Adding more people to get a 
consensus about this.

On Tue, Jun 20, 2017 at 1:49 PM, Aravinda > wrote:



Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-21 Thread Pranith Kumar Karampuri
On Wed, Jun 21, 2017 at 1:00 PM, Xavier Hernandez 
wrote:

> I'm ok with reverting node-uuid content to the previous format and create
> a new xattr for the new format. Currently, only rebalance will use it.
>
> Only thing to consider is what can happen if we have a half upgraded
> cluster where some clients have this change and some not. Can rebalance
> work in this situation ? if so, could there be any issue ?
>

I think there shouldn't be any problem, because this is in-memory xattr so
layers below afr/ec will only see node-uuid xattr.
This also gives us a chance to do whatever we want to do in future with
this xattr without any problems about backward compatibility.

You can check
https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
for how karthik implemented this in AFR (this got merged accidentally
yesterday, but looks like this is what we are settling on)


>
> Xavi
>
>
> On Wednesday, June 21, 2017 06:56 CEST, Pranith Kumar Karampuri <
> pkara...@redhat.com> wrote:
>
>
>
>
> On Wed, Jun 21, 2017 at 10:07 AM, Nithya Balachandran  > wrote:
>>
>>
>> On 20 June 2017 at 20:38, Aravinda  wrote:
>>>
>>> On 06/20/2017 06:02 PM, Pranith Kumar Karampuri wrote:
>>>
>>> Xavi, Aravinda and I had a discussion on #gluster-dev and we agreed to
>>> go with the format Aravinda suggested for now and in future we wanted some
>>> more changes for dht to detect which subvolume went down came back up, at
>>> that time we will revisit the solution suggested by Xavi.
>>>
>>> Susanth is doing the dht changes
>>> Aravinda is doing geo-rep changes
>>>
>>> Done. Geo-rep patch sent for review https://review.gluster.org/17582
>>>
>>>
>>
>> The proposed changes to the node-uuid behaviour (while good) are going to
>> break tiering . Tiering changes will take a little more time to be coded
>> and tested.
>>
>> As this is a regression for 3.11 and a blocker for 3.11.1, I suggest we
>> go back to the original node-uuid behaviour for now so as to unblock the
>> release and target the proposed changes for the next 3.11 releases.
>>
>
> Let me see if I understand the changes correctly. We are restoring the
> behavior of node-uuid xattr and adding a new xattr for parallel rebalance
> for both afr and ec, correct? Otherwise that is one more regression. If
> yes, we will also wait for Xavi's inputs. Jeff accidentally merged the afr
> patch yesterday which does these changes. If everyone is in agreement, we
> will leave it as is and add similar changes in ec as well. If we are not in
> agreement, then we will let the discussion progress :-)
>
>
>>
>>
>> Regards,
>> Nithya
>>
>>> --
>>> Aravinda
>>>
>>>
>>>
>>> Thanks to all of you guys for the discussions!
>>>
>>> On Tue, Jun 20, 2017 at 5:05 PM, Xavier Hernandez >> > wrote:

 Hi Aravinda,

 On 20/06/17 12:42, Aravinda wrote:
>
> I think following format can be easily adopted by all components
>
> UUIDs of a subvolume are seperated by space and subvolumes are
> separated
> by comma
>
> For example, node1 and node2 are replica with U1 and U2 UUIDs
> respectively and
> node3 and node4 are replica with U3 and U4 UUIDs respectively
>
> node-uuid can return "U1 U2,U3 U4"


 While this is ok for current implementation, I think this can be
 insufficient if there are more layers of xlators that require to indicate
 some sort of grouping. Some representation that can represent hierarchy
 would be better. For example: "(U1 U2) (U3 U4)" (we can use spaces or comma
 as a separator).

>
>
> Geo-rep can split by "," and then split by space and take first UUID
> DHT can split the value by space or comma and get unique UUIDs list


 This doesn't solve the problem I described in the previous email. Some
 more logic will need to be added to avoid more than one node from each
 replica-set to be active. If we have some explicit hierarchy information in
 the node-uuid value, more decisions can be taken.

 An initial proposal I made was this:

 DHT[2](AFR[2,0](NODE(U1), NODE(U2)), AFR[2,0](NODE(U1), NODE(U2)))

 This is harder to parse, but gives a lot of information: DHT with 2
 subvolumes, each subvolume is an AFR with replica 2 and no arbiters. It's
 also easily extensible with any new xlator that changes the layout.

 However maybe this is not the moment to do this, and probably we could
 implement this in a new xattr with a better name.

 Xavi

>
>
> Another question is about the behavior when a node is down, existing
> node-uuid xattr will not return that UUID if a node is down. What is
> the
> behavior with the proposed xattr?
>
> Let me know your thoughts.
>
> regards
> Aravinda VK
>
> On 06/20/2017 03:06 PM, Aravinda wrote:
>>
>> Hi 

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-21 Thread Xavier Hernandez

I'm ok with reverting node-uuid content to the previous format and create a new 
xattr for the new format. Currently, only rebalance will use it.

Only thing to consider is what can happen if we have a half upgraded cluster 
where some clients have this change and some not. Can rebalance work in this 
situation ? if so, could there be any issue ?

Xavi

On Wednesday, June 21, 2017 06:56 CEST, Pranith Kumar Karampuri 
 wrote:
   On Wed, Jun 21, 2017 at 10:07 AM, Nithya Balachandran  
wrote: On 20 June 2017 at 20:38, Aravinda  wrote:On 
06/20/2017 06:02 PM, Pranith Kumar Karampuri wrote:Xavi, Aravinda and I had a 
discussion on #gluster-dev and we agreed to go with the format Aravinda 
suggested for now and in future we wanted some more changes for dht to detect 
which subvolume went down came back up, at that time we will revisit the 
solution suggested by Xavi.
 Susanth is doing the dht changesAravinda is doing geo-rep changes Done. 
Geo-rep patch sent for review https://review.gluster.org/17582
  The proposed changes to the node-uuid behaviour (while good) are going to 
break tiering . Tiering changes will take a little more time to be coded and 
tested.  As this is a regression for 3.11 and a blocker for 3.11.1, I suggest 
we go back to the original node-uuid behaviour for now so as to unblock the 
release and target the proposed changes for the next 3.11 releases. Let me see 
if I understand the changes correctly. We are restoring the behavior of 
node-uuid xattr and adding a new xattr for parallel rebalance for both afr and 
ec, correct? Otherwise that is one more regression. If yes, we will also wait 
for Xavi's inputs. Jeff accidentally merged the afr patch yesterday which does 
these changes. If everyone is in agreement, we will leave it as is and add 
similar changes in ec as well. If we are not in agreement, then we will let the 
discussion progress :-)   Regards,Nithya--
Aravinda  Thanks to all of you guys for the discussions! On Tue, Jun 20, 2017 
at 5:05 PM, Xavier Hernandez  wrote:Hi Aravinda,

On 20/06/17 12:42, Aravinda wrote:I think following format can be easily 
adopted by all components

UUIDs of a subvolume are seperated by space and subvolumes are separated
by comma

For example, node1 and node2 are replica with U1 and U2 UUIDs
respectively and
node3 and node4 are replica with U3 and U4 UUIDs respectively

node-uuid can return "U1 U2,U3 U4"
While this is ok for current implementation, I think this can be insufficient 
if there are more layers of xlators that require to indicate some sort of 
grouping. Some representation that can represent hierarchy would be better. For 
example: "(U1 U2) (U3 U4)" (we can use spaces or comma as a separator).
 
Geo-rep can split by "," and then split by space and take first UUID
DHT can split the value by space or comma and get unique UUIDs list
This doesn't solve the problem I described in the previous email. Some more 
logic will need to be added to avoid more than one node from each replica-set 
to be active. If we have some explicit hierarchy information in the node-uuid 
value, more decisions can be taken.

An initial proposal I made was this:

DHT[2](AFR[2,0](NODE(U1), NODE(U2)), AFR[2,0](NODE(U1), NODE(U2)))

This is harder to parse, but gives a lot of information: DHT with 2 subvolumes, 
each subvolume is an AFR with replica 2 and no arbiters. It's also easily 
extensible with any new xlator that changes the layout.

However maybe this is not the moment to do this, and probably we could 
implement this in a new xattr with a better name.

Xavi 
Another question is about the behavior when a node is down, existing
node-uuid xattr will not return that UUID if a node is down. What is the
behavior with the proposed xattr?

Let me know your thoughts.

regards
Aravinda VK

On 06/20/2017 03:06 PM, Aravinda wrote:Hi Xavi,

On 06/20/2017 02:51 PM, Xavier Hernandez wrote:Hi Aravinda,

On 20/06/17 11:05, Pranith Kumar Karampuri wrote:Adding more people to get a 
consensus about this.

On Tue, Jun 20, 2017 at 1:49 PM, Aravinda > wrote:


    regards
    Aravinda VK


    On 06/20/2017 01:26 PM, Xavier Hernandez wrote:

        Hi Pranith,

        adding gluster-devel, Kotresh and Aravinda,

        On 20/06/17 09:45, Pranith Kumar Karampuri wrote:



            On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez
            
            >> wrote:

                On 20/06/17 09:31, Pranith Kumar Karampuri wrote:

                    The way geo-replication works is:
                    On each machine, it does getxattr of node-uuid and
            check if its
                    own uuid
                    is present in the list. If it is present then it
            will consider
                    

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Nithya Balachandran
On 21 June 2017 at 10:26, Pranith Kumar Karampuri 
wrote:

>
>
> On Wed, Jun 21, 2017 at 10:07 AM, Nithya Balachandran  > wrote:
>
>>
>> On 20 June 2017 at 20:38, Aravinda  wrote:
>>
>>> On 06/20/2017 06:02 PM, Pranith Kumar Karampuri wrote:
>>>
>>> Xavi, Aravinda and I had a discussion on #gluster-dev and we agreed to
>>> go with the format Aravinda suggested for now and in future we wanted some
>>> more changes for dht to detect which subvolume went down came back up, at
>>> that time we will revisit the solution suggested by Xavi.
>>>
>>> Susanth is doing the dht changes
>>> Aravinda is doing geo-rep changes
>>>
>>> Done. Geo-rep patch sent for review https://review.gluster.org/17582
>>>
>>>
>> The proposed changes to the node-uuid behaviour (while good) are going to
>> break tiering . Tiering changes will take a little more time to be coded
>> and tested.
>>
>> As this is a regression for 3.11 and a blocker for 3.11.1, I suggest we
>> go back to the original node-uuid behaviour for now so as to unblock the
>> release and target the proposed changes for the next 3.11 releases.
>>
>
> Let me see if I understand the changes correctly. We are restoring the
> behavior of node-uuid xattr and adding a new xattr for parallel rebalance
> for both afr and ec, correct?
>

Yes, this is what I understand as well. So geo-rep behaviour does not
change (node-uuid) and rebalance uses the new xattr. :)



> Otherwise that is one more regression. If yes, we will also wait for
> Xavi's inputs. Jeff accidentally merged the afr patch yesterday which does
> these changes. If everyone is in agreement, we will leave it as is and add
> similar changes in ec as well. If we are not in agreement, then we will let
> the discussion progress :-)
>
>
>>
>>
>> Regards,
>> Nithya
>>
>>> --
>>> Aravinda
>>>
>>>
>>>
>>> Thanks to all of you guys for the discussions!
>>>
>>> On Tue, Jun 20, 2017 at 5:05 PM, Xavier Hernandez >> > wrote:
>>>
 Hi Aravinda,

 On 20/06/17 12:42, Aravinda wrote:

> I think following format can be easily adopted by all components
>
> UUIDs of a subvolume are seperated by space and subvolumes are
> separated
> by comma
>
> For example, node1 and node2 are replica with U1 and U2 UUIDs
> respectively and
> node3 and node4 are replica with U3 and U4 UUIDs respectively
>
> node-uuid can return "U1 U2,U3 U4"
>

 While this is ok for current implementation, I think this can be
 insufficient if there are more layers of xlators that require to indicate
 some sort of grouping. Some representation that can represent hierarchy
 would be better. For example: "(U1 U2) (U3 U4)" (we can use spaces or comma
 as a separator).


> Geo-rep can split by "," and then split by space and take first UUID
> DHT can split the value by space or comma and get unique UUIDs list
>

 This doesn't solve the problem I described in the previous email. Some
 more logic will need to be added to avoid more than one node from each
 replica-set to be active. If we have some explicit hierarchy information in
 the node-uuid value, more decisions can be taken.

 An initial proposal I made was this:

 DHT[2](AFR[2,0](NODE(U1), NODE(U2)), AFR[2,0](NODE(U1), NODE(U2)))

 This is harder to parse, but gives a lot of information: DHT with 2
 subvolumes, each subvolume is an AFR with replica 2 and no arbiters. It's
 also easily extensible with any new xlator that changes the layout.

 However maybe this is not the moment to do this, and probably we could
 implement this in a new xattr with a better name.

 Xavi



> Another question is about the behavior when a node is down, existing
> node-uuid xattr will not return that UUID if a node is down. What is
> the
> behavior with the proposed xattr?
>
> Let me know your thoughts.
>
> regards
> Aravinda VK
>
> On 06/20/2017 03:06 PM, Aravinda wrote:
>
>> Hi Xavi,
>>
>> On 06/20/2017 02:51 PM, Xavier Hernandez wrote:
>>
>>> Hi Aravinda,
>>>
>>> On 20/06/17 11:05, Pranith Kumar Karampuri wrote:
>>>
 Adding more people to get a consensus about this.

 On Tue, Jun 20, 2017 at 1:49 PM, Aravinda > wrote:


 regards
 Aravinda VK


 On 06/20/2017 01:26 PM, Xavier Hernandez wrote:

 Hi Pranith,

 adding gluster-devel, Kotresh and Aravinda,

 On 20/06/17 09:45, Pranith Kumar Karampuri wrote:



 On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez
 

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Pranith Kumar Karampuri
On Wed, Jun 21, 2017 at 10:07 AM, Nithya Balachandran 
wrote:

>
> On 20 June 2017 at 20:38, Aravinda  wrote:
>
>> On 06/20/2017 06:02 PM, Pranith Kumar Karampuri wrote:
>>
>> Xavi, Aravinda and I had a discussion on #gluster-dev and we agreed to go
>> with the format Aravinda suggested for now and in future we wanted some
>> more changes for dht to detect which subvolume went down came back up, at
>> that time we will revisit the solution suggested by Xavi.
>>
>> Susanth is doing the dht changes
>> Aravinda is doing geo-rep changes
>>
>> Done. Geo-rep patch sent for review https://review.gluster.org/17582
>>
>>
> The proposed changes to the node-uuid behaviour (while good) are going to
> break tiering . Tiering changes will take a little more time to be coded
> and tested.
>
> As this is a regression for 3.11 and a blocker for 3.11.1, I suggest we go
> back to the original node-uuid behaviour for now so as to unblock the
> release and target the proposed changes for the next 3.11 releases.
>

Let me see if I understand the changes correctly. We are restoring the
behavior of node-uuid xattr and adding a new xattr for parallel rebalance
for both afr and ec, correct? Otherwise that is one more regression. If
yes, we will also wait for Xavi's inputs. Jeff accidentally merged the afr
patch yesterday which does these changes. If everyone is in agreement, we
will leave it as is and add similar changes in ec as well. If we are not in
agreement, then we will let the discussion progress :-)


>
>
> Regards,
> Nithya
>
>> --
>> Aravinda
>>
>>
>>
>> Thanks to all of you guys for the discussions!
>>
>> On Tue, Jun 20, 2017 at 5:05 PM, Xavier Hernandez 
>> wrote:
>>
>>> Hi Aravinda,
>>>
>>> On 20/06/17 12:42, Aravinda wrote:
>>>
 I think following format can be easily adopted by all components

 UUIDs of a subvolume are seperated by space and subvolumes are separated
 by comma

 For example, node1 and node2 are replica with U1 and U2 UUIDs
 respectively and
 node3 and node4 are replica with U3 and U4 UUIDs respectively

 node-uuid can return "U1 U2,U3 U4"

>>>
>>> While this is ok for current implementation, I think this can be
>>> insufficient if there are more layers of xlators that require to indicate
>>> some sort of grouping. Some representation that can represent hierarchy
>>> would be better. For example: "(U1 U2) (U3 U4)" (we can use spaces or comma
>>> as a separator).
>>>
>>>
 Geo-rep can split by "," and then split by space and take first UUID
 DHT can split the value by space or comma and get unique UUIDs list

>>>
>>> This doesn't solve the problem I described in the previous email. Some
>>> more logic will need to be added to avoid more than one node from each
>>> replica-set to be active. If we have some explicit hierarchy information in
>>> the node-uuid value, more decisions can be taken.
>>>
>>> An initial proposal I made was this:
>>>
>>> DHT[2](AFR[2,0](NODE(U1), NODE(U2)), AFR[2,0](NODE(U1), NODE(U2)))
>>>
>>> This is harder to parse, but gives a lot of information: DHT with 2
>>> subvolumes, each subvolume is an AFR with replica 2 and no arbiters. It's
>>> also easily extensible with any new xlator that changes the layout.
>>>
>>> However maybe this is not the moment to do this, and probably we could
>>> implement this in a new xattr with a better name.
>>>
>>> Xavi
>>>
>>>
>>>
 Another question is about the behavior when a node is down, existing
 node-uuid xattr will not return that UUID if a node is down. What is the
 behavior with the proposed xattr?

 Let me know your thoughts.

 regards
 Aravinda VK

 On 06/20/2017 03:06 PM, Aravinda wrote:

> Hi Xavi,
>
> On 06/20/2017 02:51 PM, Xavier Hernandez wrote:
>
>> Hi Aravinda,
>>
>> On 20/06/17 11:05, Pranith Kumar Karampuri wrote:
>>
>>> Adding more people to get a consensus about this.
>>>
>>> On Tue, Jun 20, 2017 at 1:49 PM, Aravinda >> > wrote:
>>>
>>>
>>> regards
>>> Aravinda VK
>>>
>>>
>>> On 06/20/2017 01:26 PM, Xavier Hernandez wrote:
>>>
>>> Hi Pranith,
>>>
>>> adding gluster-devel, Kotresh and Aravinda,
>>>
>>> On 20/06/17 09:45, Pranith Kumar Karampuri wrote:
>>>
>>>
>>>
>>> On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez
>>> 
>>> >> >> wrote:
>>>
>>> On 20/06/17 09:31, Pranith Kumar Karampuri wrote:
>>>
>>> The way geo-replication works is:
>>> On each machine, it does getxattr of node-uuid
>>> and
>>> check 

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Nithya Balachandran
On 20 June 2017 at 20:38, Aravinda  wrote:

> On 06/20/2017 06:02 PM, Pranith Kumar Karampuri wrote:
>
> Xavi, Aravinda and I had a discussion on #gluster-dev and we agreed to go
> with the format Aravinda suggested for now and in future we wanted some
> more changes for dht to detect which subvolume went down came back up, at
> that time we will revisit the solution suggested by Xavi.
>
> Susanth is doing the dht changes
> Aravinda is doing geo-rep changes
>
> Done. Geo-rep patch sent for review https://review.gluster.org/17582
>
>
The proposed changes to the node-uuid behaviour (while good) are going to
break tiering . Tiering changes will take a little more time to be coded
and tested.

As this is a regression for 3.11 and a blocker for 3.11.1, I suggest we go
back to the original node-uuid behaviour for now so as to unblock the
release and target the proposed changes for the next 3.11 releases.


Regards,
Nithya

> --
> Aravinda
>
>
>
> Thanks to all of you guys for the discussions!
>
> On Tue, Jun 20, 2017 at 5:05 PM, Xavier Hernandez 
> wrote:
>
>> Hi Aravinda,
>>
>> On 20/06/17 12:42, Aravinda wrote:
>>
>>> I think following format can be easily adopted by all components
>>>
>>> UUIDs of a subvolume are seperated by space and subvolumes are separated
>>> by comma
>>>
>>> For example, node1 and node2 are replica with U1 and U2 UUIDs
>>> respectively and
>>> node3 and node4 are replica with U3 and U4 UUIDs respectively
>>>
>>> node-uuid can return "U1 U2,U3 U4"
>>>
>>
>> While this is ok for current implementation, I think this can be
>> insufficient if there are more layers of xlators that require to indicate
>> some sort of grouping. Some representation that can represent hierarchy
>> would be better. For example: "(U1 U2) (U3 U4)" (we can use spaces or comma
>> as a separator).
>>
>>
>>> Geo-rep can split by "," and then split by space and take first UUID
>>> DHT can split the value by space or comma and get unique UUIDs list
>>>
>>
>> This doesn't solve the problem I described in the previous email. Some
>> more logic will need to be added to avoid more than one node from each
>> replica-set to be active. If we have some explicit hierarchy information in
>> the node-uuid value, more decisions can be taken.
>>
>> An initial proposal I made was this:
>>
>> DHT[2](AFR[2,0](NODE(U1), NODE(U2)), AFR[2,0](NODE(U1), NODE(U2)))
>>
>> This is harder to parse, but gives a lot of information: DHT with 2
>> subvolumes, each subvolume is an AFR with replica 2 and no arbiters. It's
>> also easily extensible with any new xlator that changes the layout.
>>
>> However maybe this is not the moment to do this, and probably we could
>> implement this in a new xattr with a better name.
>>
>> Xavi
>>
>>
>>
>>> Another question is about the behavior when a node is down, existing
>>> node-uuid xattr will not return that UUID if a node is down. What is the
>>> behavior with the proposed xattr?
>>>
>>> Let me know your thoughts.
>>>
>>> regards
>>> Aravinda VK
>>>
>>> On 06/20/2017 03:06 PM, Aravinda wrote:
>>>
 Hi Xavi,

 On 06/20/2017 02:51 PM, Xavier Hernandez wrote:

> Hi Aravinda,
>
> On 20/06/17 11:05, Pranith Kumar Karampuri wrote:
>
>> Adding more people to get a consensus about this.
>>
>> On Tue, Jun 20, 2017 at 1:49 PM, Aravinda > > wrote:
>>
>>
>> regards
>> Aravinda VK
>>
>>
>> On 06/20/2017 01:26 PM, Xavier Hernandez wrote:
>>
>> Hi Pranith,
>>
>> adding gluster-devel, Kotresh and Aravinda,
>>
>> On 20/06/17 09:45, Pranith Kumar Karampuri wrote:
>>
>>
>>
>> On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez
>> 
>> > >> wrote:
>>
>> On 20/06/17 09:31, Pranith Kumar Karampuri wrote:
>>
>> The way geo-replication works is:
>> On each machine, it does getxattr of node-uuid and
>> check if its
>> own uuid
>> is present in the list. If it is present then it
>> will consider
>> it active
>> otherwise it will be considered passive. With this
>> change we are
>> giving
>> all uuids instead of first-up subvolume. So all
>> machines think
>> they are
>> ACTIVE which is bad apparently. So that is the
>> reason. Even I
>> felt bad
>> that we are doing this change.
>>
>>
>> And what about 

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Aravinda

On 06/20/2017 06:02 PM, Pranith Kumar Karampuri wrote:
Xavi, Aravinda and I had a discussion on #gluster-dev and we agreed to 
go with the format Aravinda suggested for now and in future we wanted 
some more changes for dht to detect which subvolume went down came 
back up, at that time we will revisit the solution suggested by Xavi.


Susanth is doing the dht changes
Aravinda is doing geo-rep changes

Done. Geo-rep patch sent for review https://review.gluster.org/17582

--
Aravinda



Thanks to all of you guys for the discussions!

On Tue, Jun 20, 2017 at 5:05 PM, Xavier Hernandez 
> wrote:


Hi Aravinda,

On 20/06/17 12:42, Aravinda wrote:

I think following format can be easily adopted by all components

UUIDs of a subvolume are seperated by space and subvolumes are
separated
by comma

For example, node1 and node2 are replica with U1 and U2 UUIDs
respectively and
node3 and node4 are replica with U3 and U4 UUIDs respectively

node-uuid can return "U1 U2,U3 U4"


While this is ok for current implementation, I think this can be
insufficient if there are more layers of xlators that require to
indicate some sort of grouping. Some representation that can
represent hierarchy would be better. For example: "(U1 U2) (U3
U4)" (we can use spaces or comma as a separator).


Geo-rep can split by "," and then split by space and take
first UUID
DHT can split the value by space or comma and get unique UUIDs
list


This doesn't solve the problem I described in the previous email.
Some more logic will need to be added to avoid more than one node
from each replica-set to be active. If we have some explicit
hierarchy information in the node-uuid value, more decisions can
be taken.

An initial proposal I made was this:

DHT[2](AFR[2,0](NODE(U1), NODE(U2)), AFR[2,0](NODE(U1), NODE(U2)))

This is harder to parse, but gives a lot of information: DHT with
2 subvolumes, each subvolume is an AFR with replica 2 and no
arbiters. It's also easily extensible with any new xlator that
changes the layout.

However maybe this is not the moment to do this, and probably we
could implement this in a new xattr with a better name.

Xavi



Another question is about the behavior when a node is down,
existing
node-uuid xattr will not return that UUID if a node is down.
What is the
behavior with the proposed xattr?

Let me know your thoughts.

regards
Aravinda VK

On 06/20/2017 03:06 PM, Aravinda wrote:

Hi Xavi,

On 06/20/2017 02:51 PM, Xavier Hernandez wrote:

Hi Aravinda,

On 20/06/17 11:05, Pranith Kumar Karampuri wrote:

Adding more people to get a consensus about this.

On Tue, Jun 20, 2017 at 1:49 PM, Aravinda

>> wrote:


regards
Aravinda VK


On 06/20/2017 01:26 PM, Xavier Hernandez wrote:

Hi Pranith,

adding gluster-devel, Kotresh and Aravinda,

On 20/06/17 09:45, Pranith Kumar Karampuri
wrote:



On Tue, Jun 20, 2017 at 1:12 PM,
Xavier Hernandez

>



Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Pranith Kumar Karampuri
Xavi, Aravinda and I had a discussion on #gluster-dev and we agreed to go
with the format Aravinda suggested for now and in future we wanted some
more changes for dht to detect which subvolume went down came back up, at
that time we will revisit the solution suggested by Xavi.

Susanth is doing the dht changes
Aravinda is doing geo-rep changes

Thanks to all of you guys for the discussions!

On Tue, Jun 20, 2017 at 5:05 PM, Xavier Hernandez 
wrote:

> Hi Aravinda,
>
> On 20/06/17 12:42, Aravinda wrote:
>
>> I think following format can be easily adopted by all components
>>
>> UUIDs of a subvolume are seperated by space and subvolumes are separated
>> by comma
>>
>> For example, node1 and node2 are replica with U1 and U2 UUIDs
>> respectively and
>> node3 and node4 are replica with U3 and U4 UUIDs respectively
>>
>> node-uuid can return "U1 U2,U3 U4"
>>
>
> While this is ok for current implementation, I think this can be
> insufficient if there are more layers of xlators that require to indicate
> some sort of grouping. Some representation that can represent hierarchy
> would be better. For example: "(U1 U2) (U3 U4)" (we can use spaces or comma
> as a separator).
>
>
>> Geo-rep can split by "," and then split by space and take first UUID
>> DHT can split the value by space or comma and get unique UUIDs list
>>
>
> This doesn't solve the problem I described in the previous email. Some
> more logic will need to be added to avoid more than one node from each
> replica-set to be active. If we have some explicit hierarchy information in
> the node-uuid value, more decisions can be taken.
>
> An initial proposal I made was this:
>
> DHT[2](AFR[2,0](NODE(U1), NODE(U2)), AFR[2,0](NODE(U1), NODE(U2)))
>
> This is harder to parse, but gives a lot of information: DHT with 2
> subvolumes, each subvolume is an AFR with replica 2 and no arbiters. It's
> also easily extensible with any new xlator that changes the layout.
>
> However maybe this is not the moment to do this, and probably we could
> implement this in a new xattr with a better name.
>
> Xavi
>
>
>
>> Another question is about the behavior when a node is down, existing
>> node-uuid xattr will not return that UUID if a node is down. What is the
>> behavior with the proposed xattr?
>>
>> Let me know your thoughts.
>>
>> regards
>> Aravinda VK
>>
>> On 06/20/2017 03:06 PM, Aravinda wrote:
>>
>>> Hi Xavi,
>>>
>>> On 06/20/2017 02:51 PM, Xavier Hernandez wrote:
>>>
 Hi Aravinda,

 On 20/06/17 11:05, Pranith Kumar Karampuri wrote:

> Adding more people to get a consensus about this.
>
> On Tue, Jun 20, 2017 at 1:49 PM, Aravinda  > wrote:
>
>
> regards
> Aravinda VK
>
>
> On 06/20/2017 01:26 PM, Xavier Hernandez wrote:
>
> Hi Pranith,
>
> adding gluster-devel, Kotresh and Aravinda,
>
> On 20/06/17 09:45, Pranith Kumar Karampuri wrote:
>
>
>
> On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez
> 
>  >> wrote:
>
> On 20/06/17 09:31, Pranith Kumar Karampuri wrote:
>
> The way geo-replication works is:
> On each machine, it does getxattr of node-uuid and
> check if its
> own uuid
> is present in the list. If it is present then it
> will consider
> it active
> otherwise it will be considered passive. With this
> change we are
> giving
> all uuids instead of first-up subvolume. So all
> machines think
> they are
> ACTIVE which is bad apparently. So that is the
> reason. Even I
> felt bad
> that we are doing this change.
>
>
> And what about changing the content of node-uuid to
> include some
> sort of hierarchy ?
>
> for example:
>
> a single brick:
>
> NODE()
>
> AFR/EC:
>
> AFR[2](NODE(), NODE())
> EC[3,1](NODE(), NODE(), NODE())
>
> DHT:
>
> DHT[2](AFR[2](NODE(), NODE()),
> AFR[2](NODE(),
> NODE()))
>
> This gives a lot of information that can be used to
> take the
> appropriate decisions.
>
>
> I guess that is not backward compatible. Shall I CC
> 

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Xavier Hernandez

Hi Aravinda,

On 20/06/17 12:42, Aravinda wrote:

I think following format can be easily adopted by all components

UUIDs of a subvolume are seperated by space and subvolumes are separated
by comma

For example, node1 and node2 are replica with U1 and U2 UUIDs
respectively and
node3 and node4 are replica with U3 and U4 UUIDs respectively

node-uuid can return "U1 U2,U3 U4"


While this is ok for current implementation, I think this can be 
insufficient if there are more layers of xlators that require to 
indicate some sort of grouping. Some representation that can represent 
hierarchy would be better. For example: "(U1 U2) (U3 U4)" (we can use 
spaces or comma as a separator).




Geo-rep can split by "," and then split by space and take first UUID
DHT can split the value by space or comma and get unique UUIDs list


This doesn't solve the problem I described in the previous email. Some 
more logic will need to be added to avoid more than one node from each 
replica-set to be active. If we have some explicit hierarchy information 
in the node-uuid value, more decisions can be taken.


An initial proposal I made was this:

DHT[2](AFR[2,0](NODE(U1), NODE(U2)), AFR[2,0](NODE(U1), NODE(U2)))

This is harder to parse, but gives a lot of information: DHT with 2 
subvolumes, each subvolume is an AFR with replica 2 and no arbiters. 
It's also easily extensible with any new xlator that changes the layout.


However maybe this is not the moment to do this, and probably we could 
implement this in a new xattr with a better name.


Xavi



Another question is about the behavior when a node is down, existing
node-uuid xattr will not return that UUID if a node is down. What is the
behavior with the proposed xattr?

Let me know your thoughts.

regards
Aravinda VK

On 06/20/2017 03:06 PM, Aravinda wrote:

Hi Xavi,

On 06/20/2017 02:51 PM, Xavier Hernandez wrote:

Hi Aravinda,

On 20/06/17 11:05, Pranith Kumar Karampuri wrote:

Adding more people to get a consensus about this.

On Tue, Jun 20, 2017 at 1:49 PM, Aravinda > wrote:


regards
Aravinda VK


On 06/20/2017 01:26 PM, Xavier Hernandez wrote:

Hi Pranith,

adding gluster-devel, Kotresh and Aravinda,

On 20/06/17 09:45, Pranith Kumar Karampuri wrote:



On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez

>> wrote:

On 20/06/17 09:31, Pranith Kumar Karampuri wrote:

The way geo-replication works is:
On each machine, it does getxattr of node-uuid and
check if its
own uuid
is present in the list. If it is present then it
will consider
it active
otherwise it will be considered passive. With this
change we are
giving
all uuids instead of first-up subvolume. So all
machines think
they are
ACTIVE which is bad apparently. So that is the
reason. Even I
felt bad
that we are doing this change.


And what about changing the content of node-uuid to
include some
sort of hierarchy ?

for example:

a single brick:

NODE()

AFR/EC:

AFR[2](NODE(), NODE())
EC[3,1](NODE(), NODE(), NODE())

DHT:

DHT[2](AFR[2](NODE(), NODE()),
AFR[2](NODE(),
NODE()))

This gives a lot of information that can be used to
take the
appropriate decisions.


I guess that is not backward compatible. Shall I CC
gluster-devel and
Kotresh/Aravinda?


Is the change we did backward compatible ? if we only require
the first field to be a GUID to support backward compatibility,
we can use something like this:

No. But the necessary change can be made to Geo-rep code as well if
format is changed, Since all these are built/shipped together.

Geo-rep uses node-id as follows,

list = listxattr(node-uuid)
active_node_uuids = list.split(SPACE)
active_node_flag = True if self.node_id exists in active_node_uuids
else False


How was this case solved ?

suppose we have three servers and 2 bricks in each server. A
replicated volume is created using the following command:

gluster volume create test replica 2 server1:/brick1 server2:/brick1
server2:/brick2 server3:/brick1 server3:/brick1 server1:/brick2

In this case we have three replica-sets:

* server1:/brick1 server2:/brick1
* server2:/brick2 server3:/brick1
* server3:/brick2 server2:/brick2

Old AFR 

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Sunil Kumar Heggodu Gopala Acharya
EC also sends all zeros if the node is down.

Regards,

Sunil kumar Acharya

Senior Software Engineer

Red Hat



T: +91-8067935170 


TRIED. TESTED. TRUSTED. 
On Tue, Jun 20, 2017 at 4:27 PM, Karthik Subrahmanya 
wrote:

>
>
> On Tue, Jun 20, 2017 at 4:12 PM, Aravinda  wrote:
>
>> I think following format can be easily adopted by all components
>>
>> UUIDs of a subvolume are seperated by space and subvolumes are separated
>> by comma
>>
>> For example, node1 and node2 are replica with U1 and U2 UUIDs
>> respectively and
>> node3 and node4 are replica with U3 and U4 UUIDs respectively
>>
>> node-uuid can return "U1 U2,U3 U4"
>>
>> Geo-rep can split by "," and then split by space and take first UUID
>> DHT can split the value by space or comma and get unique UUIDs list
>>
>> Another question is about the behavior when a node is down, existing
>> node-uuid xattr will not return that UUID if a node is down.
>
> After the change [1], if a node is down we send all zeros as the uuid for
> that node, in the list of node uuids.
>
> [1] https://review.gluster.org/#/c/17084/
>
> Regards,
> Karthik
>
>> What is the behavior with the proposed xattr?
>>
>> Let me know your thoughts.
>>
>> regards
>> Aravinda VK
>>
>>
>> On 06/20/2017 03:06 PM, Aravinda wrote:
>>
>>> Hi Xavi,
>>>
>>> On 06/20/2017 02:51 PM, Xavier Hernandez wrote:
>>>
 Hi Aravinda,

 On 20/06/17 11:05, Pranith Kumar Karampuri wrote:

> Adding more people to get a consensus about this.
>
> On Tue, Jun 20, 2017 at 1:49 PM, Aravinda  > wrote:
>
>
> regards
> Aravinda VK
>
>
> On 06/20/2017 01:26 PM, Xavier Hernandez wrote:
>
> Hi Pranith,
>
> adding gluster-devel, Kotresh and Aravinda,
>
> On 20/06/17 09:45, Pranith Kumar Karampuri wrote:
>
>
>
> On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez
> 
>  >> wrote:
>
> On 20/06/17 09:31, Pranith Kumar Karampuri wrote:
>
> The way geo-replication works is:
> On each machine, it does getxattr of node-uuid and
> check if its
> own uuid
> is present in the list. If it is present then it
> will consider
> it active
> otherwise it will be considered passive. With this
> change we are
> giving
> all uuids instead of first-up subvolume. So all
> machines think
> they are
> ACTIVE which is bad apparently. So that is the
> reason. Even I
> felt bad
> that we are doing this change.
>
>
> And what about changing the content of node-uuid to
> include some
> sort of hierarchy ?
>
> for example:
>
> a single brick:
>
> NODE()
>
> AFR/EC:
>
> AFR[2](NODE(), NODE())
> EC[3,1](NODE(), NODE(), NODE())
>
> DHT:
>
> DHT[2](AFR[2](NODE(), NODE()),
> AFR[2](NODE(),
> NODE()))
>
> This gives a lot of information that can be used to
> take the
> appropriate decisions.
>
>
> I guess that is not backward compatible. Shall I CC
> gluster-devel and
> Kotresh/Aravinda?
>
>
> Is the change we did backward compatible ? if we only require
> the first field to be a GUID to support backward compatibility,
> we can use something like this:
>
> No. But the necessary change can be made to Geo-rep code as well if
> format is changed, Since all these are built/shipped together.
>
> Geo-rep uses node-id as follows,
>
> list = listxattr(node-uuid)
> active_node_uuids = list.split(SPACE)
> active_node_flag = True if self.node_id exists in active_node_uuids
> else False
>

 How was this case solved ?

 suppose we have three servers and 2 bricks in each server. A replicated
 volume is created using the following command:

 gluster volume create test replica 2 server1:/brick1 server2:/brick1
 server2:/brick2 

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Karthik Subrahmanya
On Tue, Jun 20, 2017 at 4:12 PM, Aravinda  wrote:

> I think following format can be easily adopted by all components
>
> UUIDs of a subvolume are seperated by space and subvolumes are separated
> by comma
>
> For example, node1 and node2 are replica with U1 and U2 UUIDs respectively
> and
> node3 and node4 are replica with U3 and U4 UUIDs respectively
>
> node-uuid can return "U1 U2,U3 U4"
>
> Geo-rep can split by "," and then split by space and take first UUID
> DHT can split the value by space or comma and get unique UUIDs list
>
> Another question is about the behavior when a node is down, existing
> node-uuid xattr will not return that UUID if a node is down.

After the change [1], if a node is down we send all zeros as the uuid for
that node, in the list of node uuids.

[1] https://review.gluster.org/#/c/17084/

Regards,
Karthik

> What is the behavior with the proposed xattr?
>
> Let me know your thoughts.
>
> regards
> Aravinda VK
>
>
> On 06/20/2017 03:06 PM, Aravinda wrote:
>
>> Hi Xavi,
>>
>> On 06/20/2017 02:51 PM, Xavier Hernandez wrote:
>>
>>> Hi Aravinda,
>>>
>>> On 20/06/17 11:05, Pranith Kumar Karampuri wrote:
>>>
 Adding more people to get a consensus about this.

 On Tue, Jun 20, 2017 at 1:49 PM, Aravinda > wrote:


 regards
 Aravinda VK


 On 06/20/2017 01:26 PM, Xavier Hernandez wrote:

 Hi Pranith,

 adding gluster-devel, Kotresh and Aravinda,

 On 20/06/17 09:45, Pranith Kumar Karampuri wrote:



 On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez
 
 >> wrote:

 On 20/06/17 09:31, Pranith Kumar Karampuri wrote:

 The way geo-replication works is:
 On each machine, it does getxattr of node-uuid and
 check if its
 own uuid
 is present in the list. If it is present then it
 will consider
 it active
 otherwise it will be considered passive. With this
 change we are
 giving
 all uuids instead of first-up subvolume. So all
 machines think
 they are
 ACTIVE which is bad apparently. So that is the
 reason. Even I
 felt bad
 that we are doing this change.


 And what about changing the content of node-uuid to
 include some
 sort of hierarchy ?

 for example:

 a single brick:

 NODE()

 AFR/EC:

 AFR[2](NODE(), NODE())
 EC[3,1](NODE(), NODE(), NODE())

 DHT:

 DHT[2](AFR[2](NODE(), NODE()),
 AFR[2](NODE(),
 NODE()))

 This gives a lot of information that can be used to
 take the
 appropriate decisions.


 I guess that is not backward compatible. Shall I CC
 gluster-devel and
 Kotresh/Aravinda?


 Is the change we did backward compatible ? if we only require
 the first field to be a GUID to support backward compatibility,
 we can use something like this:

 No. But the necessary change can be made to Geo-rep code as well if
 format is changed, Since all these are built/shipped together.

 Geo-rep uses node-id as follows,

 list = listxattr(node-uuid)
 active_node_uuids = list.split(SPACE)
 active_node_flag = True if self.node_id exists in active_node_uuids
 else False

>>>
>>> How was this case solved ?
>>>
>>> suppose we have three servers and 2 bricks in each server. A replicated
>>> volume is created using the following command:
>>>
>>> gluster volume create test replica 2 server1:/brick1 server2:/brick1
>>> server2:/brick2 server3:/brick1 server3:/brick1 server1:/brick2
>>>
>>> In this case we have three replica-sets:
>>>
>>> * server1:/brick1 server2:/brick1
>>> * server2:/brick2 server3:/brick1
>>> * server3:/brick2 server2:/brick2
>>>
>>> Old AFR implementation for node-uuid always returned the uuid of the
>>> node of the first brick, so in this case we will get the uuid of the three
>>> nodes because all of them are the first brick of a replica-set.
>>>
>>> Does this mean that with this configuration all nodes are active ? Is
>>> this a 

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Aravinda

I think following format can be easily adopted by all components

UUIDs of a subvolume are seperated by space and subvolumes are separated 
by comma


For example, node1 and node2 are replica with U1 and U2 UUIDs 
respectively and

node3 and node4 are replica with U3 and U4 UUIDs respectively

node-uuid can return "U1 U2,U3 U4"

Geo-rep can split by "," and then split by space and take first UUID
DHT can split the value by space or comma and get unique UUIDs list

Another question is about the behavior when a node is down, existing 
node-uuid xattr will not return that UUID if a node is down. What is the 
behavior with the proposed xattr?


Let me know your thoughts.

regards
Aravinda VK

On 06/20/2017 03:06 PM, Aravinda wrote:

Hi Xavi,

On 06/20/2017 02:51 PM, Xavier Hernandez wrote:

Hi Aravinda,

On 20/06/17 11:05, Pranith Kumar Karampuri wrote:

Adding more people to get a consensus about this.

On Tue, Jun 20, 2017 at 1:49 PM, Aravinda > wrote:


regards
Aravinda VK


On 06/20/2017 01:26 PM, Xavier Hernandez wrote:

Hi Pranith,

adding gluster-devel, Kotresh and Aravinda,

On 20/06/17 09:45, Pranith Kumar Karampuri wrote:



On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez

>> wrote:

On 20/06/17 09:31, Pranith Kumar Karampuri wrote:

The way geo-replication works is:
On each machine, it does getxattr of node-uuid and
check if its
own uuid
is present in the list. If it is present then it
will consider
it active
otherwise it will be considered passive. With this
change we are
giving
all uuids instead of first-up subvolume. So all
machines think
they are
ACTIVE which is bad apparently. So that is the
reason. Even I
felt bad
that we are doing this change.


And what about changing the content of node-uuid to
include some
sort of hierarchy ?

for example:

a single brick:

NODE()

AFR/EC:

AFR[2](NODE(), NODE())
EC[3,1](NODE(), NODE(), NODE())

DHT:

DHT[2](AFR[2](NODE(), NODE()),
AFR[2](NODE(),
NODE()))

This gives a lot of information that can be used to 
take the

appropriate decisions.


I guess that is not backward compatible. Shall I CC
gluster-devel and
Kotresh/Aravinda?


Is the change we did backward compatible ? if we only require
the first field to be a GUID to support backward compatibility,
we can use something like this:

No. But the necessary change can be made to Geo-rep code as well if
format is changed, Since all these are built/shipped together.

Geo-rep uses node-id as follows,

list = listxattr(node-uuid)
active_node_uuids = list.split(SPACE)
active_node_flag = True if self.node_id exists in active_node_uuids
else False


How was this case solved ?

suppose we have three servers and 2 bricks in each server. A 
replicated volume is created using the following command:


gluster volume create test replica 2 server1:/brick1 server2:/brick1 
server2:/brick2 server3:/brick1 server3:/brick1 server1:/brick2


In this case we have three replica-sets:

* server1:/brick1 server2:/brick1
* server2:/brick2 server3:/brick1
* server3:/brick2 server2:/brick2

Old AFR implementation for node-uuid always returned the uuid of the 
node of the first brick, so in this case we will get the uuid of the 
three nodes because all of them are the first brick of a replica-set.


Does this mean that with this configuration all nodes are active ? Is 
this a problem ? Is there any other check to avoid this situation if 
it's not good ?
Yes all Geo-rep workers will become Active and participate in syncing. 
Since changelogs will have the same information in replica bricks this 
will lead to duplicate syncing and consuming network bandwidth.


Node-uuid based Active worker is the default configuration in Geo-rep 
till now, Geo-rep also has Meta Volume based syncronization for Active 
worker using lock files.(Can be opted using Geo-rep configuration, 
with this config node-uuid will not be used)


Kotresh proposed a solution to configure which worker to become 
Active. This will give more control to Admin to choose Active workers, 
This will become default configuration from 3.12

https://github.com/gluster/glusterfs/issues/244

--
Aravinda



Xavi

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Aravinda

Hi Xavi,

On 06/20/2017 02:51 PM, Xavier Hernandez wrote:

Hi Aravinda,

On 20/06/17 11:05, Pranith Kumar Karampuri wrote:

Adding more people to get a consensus about this.

On Tue, Jun 20, 2017 at 1:49 PM, Aravinda > wrote:


regards
Aravinda VK


On 06/20/2017 01:26 PM, Xavier Hernandez wrote:

Hi Pranith,

adding gluster-devel, Kotresh and Aravinda,

On 20/06/17 09:45, Pranith Kumar Karampuri wrote:



On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez

>> wrote:

On 20/06/17 09:31, Pranith Kumar Karampuri wrote:

The way geo-replication works is:
On each machine, it does getxattr of node-uuid and
check if its
own uuid
is present in the list. If it is present then it
will consider
it active
otherwise it will be considered passive. With this
change we are
giving
all uuids instead of first-up subvolume. So all
machines think
they are
ACTIVE which is bad apparently. So that is the
reason. Even I
felt bad
that we are doing this change.


And what about changing the content of node-uuid to
include some
sort of hierarchy ?

for example:

a single brick:

NODE()

AFR/EC:

AFR[2](NODE(), NODE())
EC[3,1](NODE(), NODE(), NODE())

DHT:

DHT[2](AFR[2](NODE(), NODE()),
AFR[2](NODE(),
NODE()))

This gives a lot of information that can be used to 
take the

appropriate decisions.


I guess that is not backward compatible. Shall I CC
gluster-devel and
Kotresh/Aravinda?


Is the change we did backward compatible ? if we only require
the first field to be a GUID to support backward compatibility,
we can use something like this:

No. But the necessary change can be made to Geo-rep code as well if
format is changed, Since all these are built/shipped together.

Geo-rep uses node-id as follows,

list = listxattr(node-uuid)
active_node_uuids = list.split(SPACE)
active_node_flag = True if self.node_id exists in active_node_uuids
else False


How was this case solved ?

suppose we have three servers and 2 bricks in each server. A 
replicated volume is created using the following command:


gluster volume create test replica 2 server1:/brick1 server2:/brick1 
server2:/brick2 server3:/brick1 server3:/brick1 server1:/brick2


In this case we have three replica-sets:

* server1:/brick1 server2:/brick1
* server2:/brick2 server3:/brick1
* server3:/brick2 server2:/brick2

Old AFR implementation for node-uuid always returned the uuid of the 
node of the first brick, so in this case we will get the uuid of the 
three nodes because all of them are the first brick of a replica-set.


Does this mean that with this configuration all nodes are active ? Is 
this a problem ? Is there any other check to avoid this situation if 
it's not good ?
Yes all Geo-rep workers will become Active and participate in syncing. 
Since changelogs will have the same information in replica bricks this 
will lead to duplicate syncing and consuming network bandwidth.


Node-uuid based Active worker is the default configuration in Geo-rep 
till now, Geo-rep also has Meta Volume based syncronization for Active 
worker using lock files.(Can be opted using Geo-rep configuration, with 
this config node-uuid will not be used)


Kotresh proposed a solution to configure which worker to become Active. 
This will give more control to Admin to choose Active workers, This will 
become default configuration from 3.12

https://github.com/gluster/glusterfs/issues/244

--
Aravinda



Xavi





Bricks:



AFR/EC:
(, )

DHT:
((, ...), (, ...))

In this case, AFR and EC would return the same  they
returned before the patch, but between '(' and ')' they put the
full list of guid's of all nodes. The first  can be used
by geo-replication. The list after the first  can be used
for rebalance.

Not sure if there's any user of node-uuid above DHT.

Xavi




Xavi


On Tue, Jun 20, 2017 at 12:46 PM, Xavier Hernandez
 >
   

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Xavier Hernandez

Hi Aravinda,

On 20/06/17 11:05, Pranith Kumar Karampuri wrote:

Adding more people to get a consensus about this.

On Tue, Jun 20, 2017 at 1:49 PM, Aravinda > wrote:


regards
Aravinda VK


On 06/20/2017 01:26 PM, Xavier Hernandez wrote:

Hi Pranith,

adding gluster-devel, Kotresh and Aravinda,

On 20/06/17 09:45, Pranith Kumar Karampuri wrote:



On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez

>> wrote:

On 20/06/17 09:31, Pranith Kumar Karampuri wrote:

The way geo-replication works is:
On each machine, it does getxattr of node-uuid and
check if its
own uuid
is present in the list. If it is present then it
will consider
it active
otherwise it will be considered passive. With this
change we are
giving
all uuids instead of first-up subvolume. So all
machines think
they are
ACTIVE which is bad apparently. So that is the
reason. Even I
felt bad
that we are doing this change.


And what about changing the content of node-uuid to
include some
sort of hierarchy ?

for example:

a single brick:

NODE()

AFR/EC:

AFR[2](NODE(), NODE())
EC[3,1](NODE(), NODE(), NODE())

DHT:

DHT[2](AFR[2](NODE(), NODE()),
AFR[2](NODE(),
NODE()))

This gives a lot of information that can be used to take the
appropriate decisions.


I guess that is not backward compatible. Shall I CC
gluster-devel and
Kotresh/Aravinda?


Is the change we did backward compatible ? if we only require
the first field to be a GUID to support backward compatibility,
we can use something like this:

No. But the necessary change can be made to Geo-rep code as well if
format is changed, Since all these are built/shipped together.

Geo-rep uses node-id as follows,

list = listxattr(node-uuid)
active_node_uuids = list.split(SPACE)
active_node_flag = True if self.node_id exists in active_node_uuids
else False


How was this case solved ?

suppose we have three servers and 2 bricks in each server. A replicated 
volume is created using the following command:


gluster volume create test replica 2 server1:/brick1 server2:/brick1 
server2:/brick2 server3:/brick1 server3:/brick1 server1:/brick2


In this case we have three replica-sets:

* server1:/brick1 server2:/brick1
* server2:/brick2 server3:/brick1
* server3:/brick2 server2:/brick2

Old AFR implementation for node-uuid always returned the uuid of the 
node of the first brick, so in this case we will get the uuid of the 
three nodes because all of them are the first brick of a replica-set.


Does this mean that with this configuration all nodes are active ? Is 
this a problem ? Is there any other check to avoid this situation if 
it's not good ?


Xavi





Bricks:



AFR/EC:
(, )

DHT:
((, ...), (, ...))

In this case, AFR and EC would return the same  they
returned before the patch, but between '(' and ')' they put the
full list of guid's of all nodes. The first  can be used
by geo-replication. The list after the first  can be used
for rebalance.

Not sure if there's any user of node-uuid above DHT.

Xavi




Xavi


On Tue, Jun 20, 2017 at 12:46 PM, Xavier Hernandez
 >
 

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Pranith Kumar Karampuri
Adding more people to get a consensus about this.

On Tue, Jun 20, 2017 at 1:49 PM, Aravinda  wrote:

>
> regards
> Aravinda VK
>
>
> On 06/20/2017 01:26 PM, Xavier Hernandez wrote:
>
>> Hi Pranith,
>>
>> adding gluster-devel, Kotresh and Aravinda,
>>
>> On 20/06/17 09:45, Pranith Kumar Karampuri wrote:
>>
>>>
>>>
>>> On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez >> > wrote:
>>>
>>> On 20/06/17 09:31, Pranith Kumar Karampuri wrote:
>>>
>>> The way geo-replication works is:
>>> On each machine, it does getxattr of node-uuid and check if its
>>> own uuid
>>> is present in the list. If it is present then it will consider
>>> it active
>>> otherwise it will be considered passive. With this change we are
>>> giving
>>> all uuids instead of first-up subvolume. So all machines think
>>> they are
>>> ACTIVE which is bad apparently. So that is the reason. Even I
>>> felt bad
>>> that we are doing this change.
>>>
>>>
>>> And what about changing the content of node-uuid to include some
>>> sort of hierarchy ?
>>>
>>> for example:
>>>
>>> a single brick:
>>>
>>> NODE()
>>>
>>> AFR/EC:
>>>
>>> AFR[2](NODE(), NODE())
>>> EC[3,1](NODE(), NODE(), NODE())
>>>
>>> DHT:
>>>
>>> DHT[2](AFR[2](NODE(), NODE()), AFR[2](NODE(),
>>> NODE()))
>>>
>>> This gives a lot of information that can be used to take the
>>> appropriate decisions.
>>>
>>>
>>> I guess that is not backward compatible. Shall I CC gluster-devel and
>>> Kotresh/Aravinda?
>>>
>>
>> Is the change we did backward compatible ? if we only require the first
>> field to be a GUID to support backward compatibility, we can use something
>> like this:
>>
> No. But the necessary change can be made to Geo-rep code as well if format
> is changed, Since all these are built/shipped together.
>
> Geo-rep uses node-id as follows,
>
> list = listxattr(node-uuid)
> active_node_uuids = list.split(SPACE)
> active_node_flag = True if self.node_id exists in active_node_uuids else
> False
>
>
>
>> Bricks:
>>
>> 
>>
>> AFR/EC:
>> (, )
>>
>> DHT:
>> ((, ...), (, ...))
>>
>> In this case, AFR and EC would return the same  they returned
>> before the patch, but between '(' and ')' they put the full list of guid's
>> of all nodes. The first  can be used by geo-replication. The list
>> after the first  can be used for rebalance.
>>
>> Not sure if there's any user of node-uuid above DHT.
>>
>> Xavi
>>
>>
>>>
>>>
>>> Xavi
>>>
>>>
>>> On Tue, Jun 20, 2017 at 12:46 PM, Xavier Hernandez
>>> 
>>> >>
>>> wrote:
>>>
>>> Hi Pranith,
>>>
>>> On 20/06/17 07:53, Pranith Kumar Karampuri wrote:
>>>
>>> hi Xavi,
>>>We all made the mistake of not sending about
>>> changing
>>> behavior of
>>> node-uuid xattr so that rebalance can use multiple nodes
>>> for doing
>>> rebalance. Because of this on geo-rep all the workers
>>> are becoming
>>> active instead of one per EC/AFR subvolume. So we are
>>> frantically trying
>>> to restore the functionality of node-uuid and introduce
>>> a new
>>> xattr for
>>> the new behavior. Sunil will be sending out a patch for
>>> this.
>>>
>>>
>>> Wouldn't it be better to change geo-rep behavior to use the
>>> new data
>>> ? I think it's better as it's now, since it gives more
>>> information
>>> to upper layers so that they can take more accurate
>>> decisions.
>>>
>>> Xavi
>>>
>>>
>>> --
>>> Pranith
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Pranith
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Pranith
>>>
>>
>>
>


-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Aravinda


regards
Aravinda VK

On 06/20/2017 01:26 PM, Xavier Hernandez wrote:

Hi Pranith,

adding gluster-devel, Kotresh and Aravinda,

On 20/06/17 09:45, Pranith Kumar Karampuri wrote:



On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez > wrote:

On 20/06/17 09:31, Pranith Kumar Karampuri wrote:

The way geo-replication works is:
On each machine, it does getxattr of node-uuid and check if its
own uuid
is present in the list. If it is present then it will consider
it active
otherwise it will be considered passive. With this change we are
giving
all uuids instead of first-up subvolume. So all machines think
they are
ACTIVE which is bad apparently. So that is the reason. Even I
felt bad
that we are doing this change.


And what about changing the content of node-uuid to include some
sort of hierarchy ?

for example:

a single brick:

NODE()

AFR/EC:

AFR[2](NODE(), NODE())
EC[3,1](NODE(), NODE(), NODE())

DHT:

DHT[2](AFR[2](NODE(), NODE()), AFR[2](NODE(),
NODE()))

This gives a lot of information that can be used to take the
appropriate decisions.


I guess that is not backward compatible. Shall I CC gluster-devel and
Kotresh/Aravinda?


Is the change we did backward compatible ? if we only require the 
first field to be a GUID to support backward compatibility, we can use 
something like this:
No. But the necessary change can be made to Geo-rep code as well if 
format is changed, Since all these are built/shipped together.


Geo-rep uses node-id as follows,

list = listxattr(node-uuid)
active_node_uuids = list.split(SPACE)
active_node_flag = True if self.node_id exists in active_node_uuids else 
False




Bricks:



AFR/EC:
(, )

DHT:
((, ...), (, ...))

In this case, AFR and EC would return the same  they returned 
before the patch, but between '(' and ')' they put the full list of 
guid's of all nodes. The first  can be used by geo-replication. 
The list after the first  can be used for rebalance.


Not sure if there's any user of node-uuid above DHT.

Xavi





Xavi


On Tue, Jun 20, 2017 at 12:46 PM, Xavier Hernandez

>>
wrote:

Hi Pranith,

On 20/06/17 07:53, Pranith Kumar Karampuri wrote:

hi Xavi,
   We all made the mistake of not sending about 
changing

behavior of
node-uuid xattr so that rebalance can use multiple nodes
for doing
rebalance. Because of this on geo-rep all the workers
are becoming
active instead of one per EC/AFR subvolume. So we are
frantically trying
to restore the functionality of node-uuid and introduce
a new
xattr for
the new behavior. Sunil will be sending out a patch for
this.


Wouldn't it be better to change geo-rep behavior to use the
new data
? I think it's better as it's now, since it gives more
information
to upper layers so that they can take more accurate 
decisions.


Xavi


--
Pranith





--
Pranith





--
Pranith




___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Xavier Hernandez

Hi Pranith,

adding gluster-devel, Kotresh and Aravinda,

On 20/06/17 09:45, Pranith Kumar Karampuri wrote:



On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez > wrote:

On 20/06/17 09:31, Pranith Kumar Karampuri wrote:

The way geo-replication works is:
On each machine, it does getxattr of node-uuid and check if its
own uuid
is present in the list. If it is present then it will consider
it active
otherwise it will be considered passive. With this change we are
giving
all uuids instead of first-up subvolume. So all machines think
they are
ACTIVE which is bad apparently. So that is the reason. Even I
felt bad
that we are doing this change.


And what about changing the content of node-uuid to include some
sort of hierarchy ?

for example:

a single brick:

NODE()

AFR/EC:

AFR[2](NODE(), NODE())
EC[3,1](NODE(), NODE(), NODE())

DHT:

DHT[2](AFR[2](NODE(), NODE()), AFR[2](NODE(),
NODE()))

This gives a lot of information that can be used to take the
appropriate decisions.


I guess that is not backward compatible. Shall I CC gluster-devel and
Kotresh/Aravinda?


Is the change we did backward compatible ? if we only require the first 
field to be a GUID to support backward compatibility, we can use 
something like this:


Bricks:



AFR/EC:
(, )

DHT:
((, ...), (, ...))

In this case, AFR and EC would return the same  they returned 
before the patch, but between '(' and ')' they put the full list of 
guid's of all nodes. The first  can be used by geo-replication. 
The list after the first  can be used for rebalance.


Not sure if there's any user of node-uuid above DHT.

Xavi





Xavi


On Tue, Jun 20, 2017 at 12:46 PM, Xavier Hernandez

>>
wrote:

Hi Pranith,

On 20/06/17 07:53, Pranith Kumar Karampuri wrote:

hi Xavi,
   We all made the mistake of not sending about changing
behavior of
node-uuid xattr so that rebalance can use multiple nodes
for doing
rebalance. Because of this on geo-rep all the workers
are becoming
active instead of one per EC/AFR subvolume. So we are
frantically trying
to restore the functionality of node-uuid and introduce
a new
xattr for
the new behavior. Sunil will be sending out a patch for
this.


Wouldn't it be better to change geo-rep behavior to use the
new data
? I think it's better as it's now, since it gives more
information
to upper layers so that they can take more accurate decisions.

Xavi


--
Pranith





--
Pranith





--
Pranith


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel