On Wed, Jun 21, 2017 at 1:00 PM, Xavier Hernandez <xhernan...@datalab.es> wrote:
> I'm ok with reverting node-uuid content to the previous format and create > a new xattr for the new format. Currently, only rebalance will use it. > > Only thing to consider is what can happen if we have a half upgraded > cluster where some clients have this change and some not. Can rebalance > work in this situation ? if so, could there be any issue ? > I think there shouldn't be any problem, because this is in-memory xattr so layers below afr/ec will only see node-uuid xattr. This also gives us a chance to do whatever we want to do in future with this xattr without any problems about backward compatibility. You can check https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507 for how karthik implemented this in AFR (this got merged accidentally yesterday, but looks like this is what we are settling on) > > Xavi > > > On Wednesday, June 21, 2017 06:56 CEST, Pranith Kumar Karampuri < > pkara...@redhat.com> wrote: > > > > > On Wed, Jun 21, 2017 at 10:07 AM, Nithya Balachandran <nbala...@redhat.com > > wrote: >> >> >> On 20 June 2017 at 20:38, Aravinda <avish...@redhat.com> wrote: >>> >>> On 06/20/2017 06:02 PM, Pranith Kumar Karampuri wrote: >>> >>> Xavi, Aravinda and I had a discussion on #gluster-dev and we agreed to >>> go with the format Aravinda suggested for now and in future we wanted some >>> more changes for dht to detect which subvolume went down came back up, at >>> that time we will revisit the solution suggested by Xavi. >>> >>> Susanth is doing the dht changes >>> Aravinda is doing geo-rep changes >>> >>> Done. Geo-rep patch sent for review https://review.gluster.org/17582 >>> >>> >> >> The proposed changes to the node-uuid behaviour (while good) are going to >> break tiering . Tiering changes will take a little more time to be coded >> and tested. >> >> As this is a regression for 3.11 and a blocker for 3.11.1, I suggest we >> go back to the original node-uuid behaviour for now so as to unblock the >> release and target the proposed changes for the next 3.11 releases. >> > > Let me see if I understand the changes correctly. We are restoring the > behavior of node-uuid xattr and adding a new xattr for parallel rebalance > for both afr and ec, correct? Otherwise that is one more regression. If > yes, we will also wait for Xavi's inputs. Jeff accidentally merged the afr > patch yesterday which does these changes. If everyone is in agreement, we > will leave it as is and add similar changes in ec as well. If we are not in > agreement, then we will let the discussion progress :-) > > >> >> >> Regards, >> Nithya >> >>> -- >>> Aravinda >>> >>> >>> >>> Thanks to all of you guys for the discussions! >>> >>> On Tue, Jun 20, 2017 at 5:05 PM, Xavier Hernandez <xhernan...@datalab.es >>> > wrote: >>>> >>>> Hi Aravinda, >>>> >>>> On 20/06/17 12:42, Aravinda wrote: >>>>> >>>>> I think following format can be easily adopted by all components >>>>> >>>>> UUIDs of a subvolume are seperated by space and subvolumes are >>>>> separated >>>>> by comma >>>>> >>>>> For example, node1 and node2 are replica with U1 and U2 UUIDs >>>>> respectively and >>>>> node3 and node4 are replica with U3 and U4 UUIDs respectively >>>>> >>>>> node-uuid can return "U1 U2,U3 U4" >>>> >>>> >>>> While this is ok for current implementation, I think this can be >>>> insufficient if there are more layers of xlators that require to indicate >>>> some sort of grouping. Some representation that can represent hierarchy >>>> would be better. For example: "(U1 U2) (U3 U4)" (we can use spaces or comma >>>> as a separator). >>>> >>>>> >>>>> >>>>> Geo-rep can split by "," and then split by space and take first UUID >>>>> DHT can split the value by space or comma and get unique UUIDs list >>>> >>>> >>>> This doesn't solve the problem I described in the previous email. Some >>>> more logic will need to be added to avoid more than one node from each >>>> replica-set to be active. If we have some explicit hierarchy information in >>>> the node-uuid value, more decisions can be taken. >>>> >>>> An initial proposal I made was this: >>>> >>>> DHT[2](AFR[2,0](NODE(U1), NODE(U2)), AFR[2,0](NODE(U1), NODE(U2))) >>>> >>>> This is harder to parse, but gives a lot of information: DHT with 2 >>>> subvolumes, each subvolume is an AFR with replica 2 and no arbiters. It's >>>> also easily extensible with any new xlator that changes the layout. >>>> >>>> However maybe this is not the moment to do this, and probably we could >>>> implement this in a new xattr with a better name. >>>> >>>> Xavi >>>> >>>>> >>>>> >>>>> Another question is about the behavior when a node is down, existing >>>>> node-uuid xattr will not return that UUID if a node is down. What is >>>>> the >>>>> behavior with the proposed xattr? >>>>> >>>>> Let me know your thoughts. >>>>> >>>>> regards >>>>> Aravinda VK >>>>> >>>>> On 06/20/2017 03:06 PM, Aravinda wrote: >>>>>> >>>>>> Hi Xavi, >>>>>> >>>>>> On 06/20/2017 02:51 PM, Xavier Hernandez wrote: >>>>>>> >>>>>>> Hi Aravinda, >>>>>>> >>>>>>> On 20/06/17 11:05, Pranith Kumar Karampuri wrote: >>>>>>>> >>>>>>>> Adding more people to get a consensus about this. >>>>>>>> >>>>>>>> On Tue, Jun 20, 2017 at 1:49 PM, Aravinda <avish...@redhat.com >>>>>>>> <mailto:avish...@redhat.com>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> regards >>>>>>>> Aravinda VK >>>>>>>> >>>>>>>> >>>>>>>> On 06/20/2017 01:26 PM, Xavier Hernandez wrote: >>>>>>>> >>>>>>>> Hi Pranith, >>>>>>>> >>>>>>>> adding gluster-devel, Kotresh and Aravinda, >>>>>>>> >>>>>>>> On 20/06/17 09:45, Pranith Kumar Karampuri wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez >>>>>>>> <xhernan...@datalab.es <mailto:xhernan...@datalab.es> >>>>>>>> <mailto:xhernan...@datalab.es >>>>>>>> <mailto:xhernan...@datalab.es>>> wrote: >>>>>>>> >>>>>>>> On 20/06/17 09:31, Pranith Kumar Karampuri wrote: >>>>>>>> >>>>>>>> The way geo-replication works is: >>>>>>>> On each machine, it does getxattr of node-uuid >>>>>>>> and >>>>>>>> check if its >>>>>>>> own uuid >>>>>>>> is present in the list. If it is present then it >>>>>>>> will consider >>>>>>>> it active >>>>>>>> otherwise it will be considered passive. With >>>>>>>> this >>>>>>>> change we are >>>>>>>> giving >>>>>>>> all uuids instead of first-up subvolume. So all >>>>>>>> machines think >>>>>>>> they are >>>>>>>> ACTIVE which is bad apparently. So that is the >>>>>>>> reason. Even I >>>>>>>> felt bad >>>>>>>> that we are doing this change. >>>>>>>> >>>>>>>> >>>>>>>> And what about changing the content of node-uuid to >>>>>>>> include some >>>>>>>> sort of hierarchy ? >>>>>>>> >>>>>>>> for example: >>>>>>>> >>>>>>>> a single brick: >>>>>>>> >>>>>>>> NODE(<guid>) >>>>>>>> >>>>>>>> AFR/EC: >>>>>>>> >>>>>>>> AFR[2](NODE(<guid>), NODE(<guid>)) >>>>>>>> EC[3,1](NODE(<guid>), NODE(<guid>), NODE(<guid>)) >>>>>>>> >>>>>>>> DHT: >>>>>>>> >>>>>>>> DHT[2](AFR[2](NODE(<guid>), NODE(<guid>)), >>>>>>>> AFR[2](NODE(<guid>), >>>>>>>> NODE(<guid>))) >>>>>>>> >>>>>>>> This gives a lot of information that can be used to >>>>>>>> take the >>>>>>>> appropriate decisions. >>>>>>>> >>>>>>>> >>>>>>>> I guess that is not backward compatible. Shall I CC >>>>>>>> gluster-devel and >>>>>>>> Kotresh/Aravinda? >>>>>>>> >>>>>>>> >>>>>>>> Is the change we did backward compatible ? if we only >>>>>>>> require >>>>>>>> the first field to be a GUID to support backward >>>>>>>> compatibility, >>>>>>>> we can use something like this: >>>>>>>> >>>>>>>> No. But the necessary change can be made to Geo-rep code as >>>>>>>> well if >>>>>>>> format is changed, Since all these are built/shipped together. >>>>>>>> >>>>>>>> Geo-rep uses node-id as follows, >>>>>>>> >>>>>>>> list = listxattr(node-uuid) >>>>>>>> active_node_uuids = list.split(SPACE) >>>>>>>> active_node_flag = True if self.node_id exists in >>>>>>>> active_node_uuids >>>>>>>> else False >>>>>>> >>>>>>> >>>>>>> How was this case solved ? >>>>>>> >>>>>>> suppose we have three servers and 2 bricks in each server. A >>>>>>> replicated volume is created using the following command: >>>>>>> >>>>>>> gluster volume create test replica 2 server1:/brick1 server2:/brick1 >>>>>>> server2:/brick2 server3:/brick1 server3:/brick1 server1:/brick2 >>>>>>> >>>>>>> In this case we have three replica-sets: >>>>>>> >>>>>>> * server1:/brick1 server2:/brick1 >>>>>>> * server2:/brick2 server3:/brick1 >>>>>>> * server3:/brick2 server2:/brick2 >>>>>>> >>>>>>> Old AFR implementation for node-uuid always returned the uuid of the >>>>>>> node of the first brick, so in this case we will get the uuid of the >>>>>>> three nodes because all of them are the first brick of a replica-set. >>>>>>> >>>>>>> Does this mean that with this configuration all nodes are active ? Is >>>>>>> this a problem ? Is there any other check to avoid this situation if >>>>>>> it's not good ? >>>>>> >>>>>> Yes all Geo-rep workers will become Active and participate in syncing. >>>>>> Since changelogs will have the same information in replica bricks this >>>>>> will lead to duplicate syncing and consuming network bandwidth. >>>>>> >>>>>> Node-uuid based Active worker is the default configuration in Geo-rep >>>>>> till now, Geo-rep also has Meta Volume based syncronization for Active >>>>>> worker using lock files.(Can be opted using Geo-rep configuration, >>>>>> with this config node-uuid will not be used) >>>>>> >>>>>> Kotresh proposed a solution to configure which worker to become >>>>>> Active. This will give more control to Admin to choose Active workers, >>>>>> This will become default configuration from 3.12 >>>>>> https://github.com/gluster/glusterfs/issues/244 >>>>>> >>>>>> -- >>>>>> Aravinda >>>>>> >>>>>>> >>>>>>> >>>>>>> Xavi >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Bricks: >>>>>>>> >>>>>>>> <guid> >>>>>>>> >>>>>>>> AFR/EC: >>>>>>>> <guid>(<guid>, <guid>) >>>>>>>> >>>>>>>> DHT: >>>>>>>> <guid>(<guid>(<guid>, ...), <guid>(<guid>, ...)) >>>>>>>> >>>>>>>> In this case, AFR and EC would return the same <guid> they >>>>>>>> returned before the patch, but between '(' and ')' they put >>>>>>>> the >>>>>>>> full list of guid's of all nodes. The first <guid> can be >>>>>>>> used >>>>>>>> by geo-replication. The list after the first <guid> can be >>>>>>>> used >>>>>>>> for rebalance. >>>>>>>> >>>>>>>> Not sure if there's any user of node-uuid above DHT. >>>>>>>> >>>>>>>> Xavi >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Xavi >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Jun 20, 2017 at 12:46 PM, Xavier >>>>>>>> Hernandez >>>>>>>> <xhernan...@datalab.es >>>>>>>> <mailto:xhernan...@datalab.es> >>>>>>>> <mailto:xhernan...@datalab.es >>>>>>>> <mailto:xhernan...@datalab.es>> >>>>>>>> <mailto:xhernan...@datalab.es >>>>>>>> <mailto:xhernan...@datalab.es> >>>>>>>> <mailto:xhernan...@datalab.es >>>>>>>> <mailto:xhernan...@datalab.es>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hi Pranith, >>>>>>>> >>>>>>>> On 20/06/17 07:53, Pranith Kumar Karampuri >>>>>>>> wrote: >>>>>>>> >>>>>>>> hi Xavi, >>>>>>>> We all made the mistake of not >>>>>>>> sending about changing >>>>>>>> behavior of >>>>>>>> node-uuid xattr so that rebalance can >>>>>>>> use >>>>>>>> multiple nodes >>>>>>>> for doing >>>>>>>> rebalance. Because of this on geo-rep >>>>>>>> all >>>>>>>> the workers >>>>>>>> are becoming >>>>>>>> active instead of one per EC/AFR >>>>>>>> subvolume. >>>>>>>> So we are >>>>>>>> frantically trying >>>>>>>> to restore the functionality of >>>>>>>> node-uuid >>>>>>>> and introduce >>>>>>>> a new >>>>>>>> xattr for >>>>>>>> the new behavior. Sunil will be sending >>>>>>>> out >>>>>>>> a patch for >>>>>>>> this. >>>>>>>> >>>>>>>> >>>>>>>> Wouldn't it be better to change geo-rep >>>>>>>> behavior >>>>>>>> to use the >>>>>>>> new data >>>>>>>> ? I think it's better as it's now, since it >>>>>>>> gives more >>>>>>>> information >>>>>>>> to upper layers so that they can take more >>>>>>>> accurate decisions. >>>>>>>> >>>>>>>> Xavi >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Pranith >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Pranith >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Pranith >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Pranith >>>>>>> >>>>>>> >>> >>> >>> -- >>> Pranith >>> >>> > > > -- > Pranith > > > > -- Pranith
_______________________________________________ Gluster-devel mailing list Gluster-devel@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-devel