Re: [Gluster-devel] dht mkdir preop check, afr and (non-)readable afr subvols

2016-06-06 Thread Xavier Hernandez

Hi Raghavendra,

On 06/06/16 10:54, Raghavendra G wrote:



On Wed, Jun 1, 2016 at 12:50 PM, Xavier Hernandez <xhernan...@datalab.es
<mailto:xhernan...@datalab.es>> wrote:

Hi,

On 01/06/16 08:53, Raghavendra Gowdappa wrote:



- Original Message -

From: "Xavier Hernandez" <xhernan...@datalab.es
<mailto:xhernan...@datalab.es>>
To: "Pranith Kumar Karampuri" <pkara...@redhat.com
<mailto:pkara...@redhat.com>>, "Raghavendra G"
<raghaven...@gluster.com <mailto:raghaven...@gluster.com>>
Cc: "Gluster Devel" <gluster-devel@gluster.org
<mailto:gluster-devel@gluster.org>>
    Sent: Wednesday, June 1, 2016 11:57:12 AM
        Subject: Re: [Gluster-devel] dht mkdir preop check, afr and
(non-)readable afr subvols

Oops, you are right. For entry operations the current
version of the
parent directory is not checked, just to avoid this problem.

This means that mkdir will be sent to all alive subvolumes.
However it
still selects the group of answers that have a minimum
quorum equal or
greater than #bricks - redundancy. So it should be still valid.


What if the quorum is met on "bad" subvolumes? and mkdir was
successful on bad subvolumes? Do we consider mkdir as
successful? If yes, even EC suffers from the problem described
in bz https://bugzilla.redhat.com/show_bug.cgi?id=1341429.


I don't understand the real problem. How a subvolume of EC could be
in bad state from the point of view of DHT ?

If you use xattrs to configure something in the parent directories,
you should have needed to use setxattr or xattrop to do that. These
operations do consider good/bad bricks because they touch inode
metadata. This will only succeed if enough (quorum) bricks have
successfully processed it. If quorum is met but for an error answer,
an error will be reported to DHT and the majority of bricks will be
left in the old state (these should be considered the good
subvolumes). If some brick has succeeded, it will be considered bad
and will be healed. If no quorum is met (even for an error answer),
EIO will be returned and the state of the directory should be
considered unknown/damaged.


Yes. Ideally, dht should use a getxattr for the layout xattr. But, for
performance reasons we thought of overloading mkdir by introducing
pre-operations (done by bricks). With plain dht it is a simple
comparison of xattrs passed as argument and xattrs stored on disk. But,
I failed to include afr and EC in the picture.


I still miss something. Looking at the patch that implements this 
(http://review.gluster.org/13885), it seems that mkdir fails if the 
parent xattr is no correctly set, so it's not possible to create a 
directory on a "bad" brick.


If the majority of the subvolumes of ec fail, the whole request will 
fail and this failure will be reported to DHT. If the majority succeed, 
it will be reported to DHT, even is some of the subvolumes have failed.


Maybe if you give me a specific example I may see the real problem.

Xavi


Hence this issue. How
difficult for EC and AFR to bring this kind of check? Is it even
possible for afr and EC to implement this kind of pre-op checks with
reasonable complexity?


If a later mkdir checks this value in storage/posix and succeeds in
enough bricks, it necessarily means that is has succeeded in good
bricks, because there cannot be enough bricks with the bad xattr value.

Note that quorum is always > #bricks/2 so we cannot have a quorum
with good and bad bricks at the same time.

Xavi




Xavi

On 01/06/16 06:51, Pranith Kumar Karampuri wrote:

Xavi,
But if we keep winding only to good subvolumes,
there is a case
where bad subvolumes will never catch up right? i.e. if
we keep creating
files in same directory and everytime self-heal
completes there are more
entries mounts would have created on the good subvolumes
alone. I think
I must have missed this in the reviews if this is the
current behavior.
It was not in the earlier releases. Right?

Pranith

On Tue, May 31, 2016 at 2:17 PM, Raghavendra G
<raghaven...@gluster.com <mailto:raghaven...@gluster.com>
<mailto:raghaven...@gluster.com
<mailto:raghaven...@gluster.com>>> wrote:



On Tue, May 31, 2016 at 12:37 PM, Xavier Hernandez
<xhernan...

Re: [Gluster-devel] dht mkdir preop check, afr and (non-)readable afr subvols

2016-06-06 Thread Raghavendra G
On Wed, Jun 1, 2016 at 12:50 PM, Xavier Hernandez <xhernan...@datalab.es>
wrote:

> Hi,
>
> On 01/06/16 08:53, Raghavendra Gowdappa wrote:
>
>>
>>
>> - Original Message -
>>
>>> From: "Xavier Hernandez" <xhernan...@datalab.es>
>>> To: "Pranith Kumar Karampuri" <pkara...@redhat.com>, "Raghavendra G" <
>>> raghaven...@gluster.com>
>>> Cc: "Gluster Devel" <gluster-devel@gluster.org>
>>> Sent: Wednesday, June 1, 2016 11:57:12 AM
>>> Subject: Re: [Gluster-devel] dht mkdir preop check, afr and
>>> (non-)readable afr subvols
>>>
>>> Oops, you are right. For entry operations the current version of the
>>> parent directory is not checked, just to avoid this problem.
>>>
>>> This means that mkdir will be sent to all alive subvolumes. However it
>>> still selects the group of answers that have a minimum quorum equal or
>>> greater than #bricks - redundancy. So it should be still valid.
>>>
>>
>> What if the quorum is met on "bad" subvolumes? and mkdir was successful
>> on bad subvolumes? Do we consider mkdir as successful? If yes, even EC
>> suffers from the problem described in bz
>> https://bugzilla.redhat.com/show_bug.cgi?id=1341429.
>>
>
> I don't understand the real problem. How a subvolume of EC could be in bad
> state from the point of view of DHT ?
>
> If you use xattrs to configure something in the parent directories, you
> should have needed to use setxattr or xattrop to do that. These operations
> do consider good/bad bricks because they touch inode metadata. This will
> only succeed if enough (quorum) bricks have successfully processed it. If
> quorum is met but for an error answer, an error will be reported to DHT and
> the majority of bricks will be left in the old state (these should be
> considered the good subvolumes). If some brick has succeeded, it will be
> considered bad and will be healed. If no quorum is met (even for an error
> answer), EIO will be returned and the state of the directory should be
> considered unknown/damaged.
>

Yes. Ideally, dht should use a getxattr for the layout xattr. But, for
performance reasons we thought of overloading mkdir by introducing
pre-operations (done by bricks). With plain dht it is a simple comparison
of xattrs passed as argument and xattrs stored on disk. But, I failed to
include afr and EC in the picture. Hence this issue. How difficult for EC
and AFR to bring this kind of check? Is it even possible for afr and EC to
implement this kind of pre-op checks with reasonable complexity?


> If a later mkdir checks this value in storage/posix and succeeds in enough
> bricks, it necessarily means that is has succeeded in good bricks, because
> there cannot be enough bricks with the bad xattr value.
>
> Note that quorum is always > #bricks/2 so we cannot have a quorum with
> good and bad bricks at the same time.
>
> Xavi
>
>
>
>>
>>> Xavi
>>>
>>> On 01/06/16 06:51, Pranith Kumar Karampuri wrote:
>>>
>>>> Xavi,
>>>> But if we keep winding only to good subvolumes, there is a case
>>>> where bad subvolumes will never catch up right? i.e. if we keep creating
>>>> files in same directory and everytime self-heal completes there are more
>>>> entries mounts would have created on the good subvolumes alone. I think
>>>> I must have missed this in the reviews if this is the current behavior.
>>>> It was not in the earlier releases. Right?
>>>>
>>>> Pranith
>>>>
>>>> On Tue, May 31, 2016 at 2:17 PM, Raghavendra G <raghaven...@gluster.com
>>>> <mailto:raghaven...@gluster.com>> wrote:
>>>>
>>>>
>>>>
>>>> On Tue, May 31, 2016 at 12:37 PM, Xavier Hernandez
>>>> <xhernan...@datalab.es <mailto:xhernan...@datalab.es>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> On 31/05/16 07:05, Raghavendra Gowdappa wrote:
>>>>
>>>> +gluster-devel, +Xavi
>>>>
>>>> Hi all,
>>>>
>>>> The context is [1], where bricks do pre-operation checks
>>>> before doing a fop and proceed with fop only if pre-op check
>>>> is successful.
>>>>
>>>> @Xavi,
>>>>
>>>> We need your inputs on behavior of EC subvolumes as well.
>>>>
>>>>
>>>> If 

Re: [Gluster-devel] dht mkdir preop check, afr and (non-)readable afr subvols

2016-06-01 Thread Raghavendra Gowdappa


- Original Message -
> From: "Xavier Hernandez" <xhernan...@datalab.es>
> To: "Pranith Kumar Karampuri" <pkara...@redhat.com>, "Raghavendra G" 
> <raghaven...@gluster.com>
> Cc: "Gluster Devel" <gluster-devel@gluster.org>
> Sent: Wednesday, June 1, 2016 11:57:12 AM
> Subject: Re: [Gluster-devel] dht mkdir preop check, afr and (non-)readable 
> afr subvols
> 
> Oops, you are right. For entry operations the current version of the
> parent directory is not checked, just to avoid this problem.
> 
> This means that mkdir will be sent to all alive subvolumes. However it
> still selects the group of answers that have a minimum quorum equal or
> greater than #bricks - redundancy. So it should be still valid.

What if the quorum is met on "bad" subvolumes? and mkdir was successful on bad 
subvolumes? Do we consider mkdir as successful? If yes, even EC suffers from 
the problem described in bz https://bugzilla.redhat.com/show_bug.cgi?id=1341429.

> 
> Xavi
> 
> On 01/06/16 06:51, Pranith Kumar Karampuri wrote:
> > Xavi,
> > But if we keep winding only to good subvolumes, there is a case
> > where bad subvolumes will never catch up right? i.e. if we keep creating
> > files in same directory and everytime self-heal completes there are more
> > entries mounts would have created on the good subvolumes alone. I think
> > I must have missed this in the reviews if this is the current behavior.
> > It was not in the earlier releases. Right?
> >
> > Pranith
> >
> > On Tue, May 31, 2016 at 2:17 PM, Raghavendra G <raghaven...@gluster.com
> > <mailto:raghaven...@gluster.com>> wrote:
> >
> >
> >
> > On Tue, May 31, 2016 at 12:37 PM, Xavier Hernandez
> > <xhernan...@datalab.es <mailto:xhernan...@datalab.es>> wrote:
> >
> > Hi,
> >
> > On 31/05/16 07:05, Raghavendra Gowdappa wrote:
> >
> > +gluster-devel, +Xavi
> >
> > Hi all,
> >
> > The context is [1], where bricks do pre-operation checks
> > before doing a fop and proceed with fop only if pre-op check
> > is successful.
> >
> > @Xavi,
> >
> > We need your inputs on behavior of EC subvolumes as well.
> >
> >
> > If I understand correctly, EC shouldn't have any problems here.
> >
> > EC sends the mkdir request to all subvolumes that are currently
> > considered "good" and tries to combine the answers. Answers that
> > match in return code, errno (if necessary) and xdata contents
> > (except for some special xattrs that are ignored for combination
> > purposes), are grouped.
> >
> > Then it takes the group with more members/answers. If that group
> > has a minimum size of #bricks - redundancy, it is considered the
> > good answer. Otherwise EIO is returned because bricks are in an
> > inconsistent state.
> >
> > If there's any answer in another group, it's considered bad and
> > gets marked so that self-heal will repair it using the good
> > information from the majority of bricks.
> >
> > xdata is combined and returned even if return code is -1.
> >
> > Is that enough to cover the needed behavior ?
> >
> >
> > Thanks Xavi. That's sufficient for the feature in question. One of
> > the main cases I was interested in was what would be the behaviour
> > if mkdir succeeds on "bad" subvolume and fails on "good" subvolume.
> > Since you never wind mkdir to "bad" subvolume(s), this situation
> > never arises.
> >
> >
> >
> >
> > Xavi
> >
> >
> >
> > [1] http://review.gluster.org/13885
> >
> > regards,
> > Raghavendra
> >
> > - Original Message -
> >
> > From: "Pranith Kumar Karampuri" <pkara...@redhat.com
> > <mailto:pkara...@redhat.com>>
> > To: "Raghavendra Gowdappa" <rgowd...@redhat.com
> > <mailto:rgowd...@redhat.com>>
> > Cc: "team-quine-afr" <team-quine-...@redhat.com
> > <mailto:team-quine-...@redhat.com>>, "rhs-zteam"
> > <rhs-zt...@redhat.com <mailto:rhs-zt..

Re: [Gluster-devel] dht mkdir preop check, afr and (non-)readable afr subvols

2016-06-01 Thread Xavier Hernandez
Oops, you are right. For entry operations the current version of the 
parent directory is not checked, just to avoid this problem.


This means that mkdir will be sent to all alive subvolumes. However it 
still selects the group of answers that have a minimum quorum equal or 
greater than #bricks - redundancy. So it should be still valid.


Xavi

On 01/06/16 06:51, Pranith Kumar Karampuri wrote:

Xavi,
But if we keep winding only to good subvolumes, there is a case
where bad subvolumes will never catch up right? i.e. if we keep creating
files in same directory and everytime self-heal completes there are more
entries mounts would have created on the good subvolumes alone. I think
I must have missed this in the reviews if this is the current behavior.
It was not in the earlier releases. Right?

Pranith

On Tue, May 31, 2016 at 2:17 PM, Raghavendra G > wrote:



On Tue, May 31, 2016 at 12:37 PM, Xavier Hernandez
> wrote:

Hi,

On 31/05/16 07:05, Raghavendra Gowdappa wrote:

+gluster-devel, +Xavi

Hi all,

The context is [1], where bricks do pre-operation checks
before doing a fop and proceed with fop only if pre-op check
is successful.

@Xavi,

We need your inputs on behavior of EC subvolumes as well.


If I understand correctly, EC shouldn't have any problems here.

EC sends the mkdir request to all subvolumes that are currently
considered "good" and tries to combine the answers. Answers that
match in return code, errno (if necessary) and xdata contents
(except for some special xattrs that are ignored for combination
purposes), are grouped.

Then it takes the group with more members/answers. If that group
has a minimum size of #bricks - redundancy, it is considered the
good answer. Otherwise EIO is returned because bricks are in an
inconsistent state.

If there's any answer in another group, it's considered bad and
gets marked so that self-heal will repair it using the good
information from the majority of bricks.

xdata is combined and returned even if return code is -1.

Is that enough to cover the needed behavior ?


Thanks Xavi. That's sufficient for the feature in question. One of
the main cases I was interested in was what would be the behaviour
if mkdir succeeds on "bad" subvolume and fails on "good" subvolume.
Since you never wind mkdir to "bad" subvolume(s), this situation
never arises.




Xavi



[1] http://review.gluster.org/13885

regards,
Raghavendra

- Original Message -

From: "Pranith Kumar Karampuri" >
To: "Raghavendra Gowdappa" >
Cc: "team-quine-afr" >, "rhs-zteam"
>
Sent: Tuesday, May 31, 2016 10:22:49 AM
Subject: Re: dht mkdir preop check, afr and
(non-)readable afr subvols

I think you should start a discussion on gluster-devel
so that Xavi gets a
chance to respond on the mails as well.

On Tue, May 31, 2016 at 10:21 AM, Raghavendra Gowdappa
>
wrote:

Also note that we've plans to extend this pre-op
check to all dentry
operations which also depend parent layout. So, the
discussion need to
cover all dentry operations like:

1. create
2. mkdir
3. rmdir
4. mknod
5. symlink
6. unlink
7. rename

We also plan to have similar checks in lock codepath
for directories too
(planning to use hashed-subvolume as lock-subvolume
for directories). So,
more fops :)
8. lk (posix locks)
9. inodelk
10. entrylk

regards,
Raghavendra

- Original Message -

From: "Raghavendra Gowdappa"
>
To: "team-quine-afr" >

Re: [Gluster-devel] dht mkdir preop check, afr and (non-)readable afr subvols

2016-05-31 Thread Raghavendra G
I've filed a bug at [1] to track issue in afr.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1341429

On Tue, May 31, 2016 at 2:17 PM, Raghavendra G 
wrote:

>
>
> On Tue, May 31, 2016 at 12:37 PM, Xavier Hernandez 
> wrote:
>
>> Hi,
>>
>> On 31/05/16 07:05, Raghavendra Gowdappa wrote:
>>
>>> +gluster-devel, +Xavi
>>>
>>> Hi all,
>>>
>>> The context is [1], where bricks do pre-operation checks before doing a
>>> fop and proceed with fop only if pre-op check is successful.
>>>
>>> @Xavi,
>>>
>>> We need your inputs on behavior of EC subvolumes as well.
>>>
>>
>> If I understand correctly, EC shouldn't have any problems here.
>>
>> EC sends the mkdir request to all subvolumes that are currently
>> considered "good" and tries to combine the answers. Answers that match in
>> return code, errno (if necessary) and xdata contents (except for some
>> special xattrs that are ignored for combination purposes), are grouped.
>>
>> Then it takes the group with more members/answers. If that group has a
>> minimum size of #bricks - redundancy, it is considered the good answer.
>> Otherwise EIO is returned because bricks are in an inconsistent state.
>>
>> If there's any answer in another group, it's considered bad and gets
>> marked so that self-heal will repair it using the good information from the
>> majority of bricks.
>>
>> xdata is combined and returned even if return code is -1.
>>
>> Is that enough to cover the needed behavior ?
>>
>
> Thanks Xavi. That's sufficient for the feature in question. One of the
> main cases I was interested in was what would be the behaviour if mkdir
> succeeds on "bad" subvolume and fails on "good" subvolume. Since you never
> wind mkdir to "bad" subvolume(s), this situation never arises.
>
>
>
>>
>> Xavi
>>
>>
>>
>>> [1] http://review.gluster.org/13885
>>>
>>> regards,
>>> Raghavendra
>>>
>>> - Original Message -
>>>
 From: "Pranith Kumar Karampuri" 
 To: "Raghavendra Gowdappa" 
 Cc: "team-quine-afr" , "rhs-zteam" <
 rhs-zt...@redhat.com>
 Sent: Tuesday, May 31, 2016 10:22:49 AM
 Subject: Re: dht mkdir preop check, afr and (non-)readable afr subvols

 I think you should start a discussion on gluster-devel so that Xavi
 gets a
 chance to respond on the mails as well.

 On Tue, May 31, 2016 at 10:21 AM, Raghavendra Gowdappa <
 rgowd...@redhat.com>
 wrote:

 Also note that we've plans to extend this pre-op check to all dentry
> operations which also depend parent layout. So, the discussion need to
> cover all dentry operations like:
>
> 1. create
> 2. mkdir
> 3. rmdir
> 4. mknod
> 5. symlink
> 6. unlink
> 7. rename
>
> We also plan to have similar checks in lock codepath for directories
> too
> (planning to use hashed-subvolume as lock-subvolume for directories).
> So,
> more fops :)
> 8. lk (posix locks)
> 9. inodelk
> 10. entrylk
>
> regards,
> Raghavendra
>
> - Original Message -
>
>> From: "Raghavendra Gowdappa" 
>> To: "team-quine-afr" 
>> Cc: "rhs-zteam" 
>> Sent: Tuesday, May 31, 2016 10:15:04 AM
>> Subject: dht mkdir preop check, afr and (non-)readable afr subvols
>>
>> Hi all,
>>
>> I have some queries related to the behavior of afr_mkdir with respect
>> to
>> readable subvols.
>>
>> 1. While winding mkdir to subvols does afr check whether the
>> subvolume is
>> good/readable? Or does it wind to all subvols irrespective of whether
>> a
>> subvol is good/bad? In the latter case, what if
>>a. mkdir succeeds on non-readable subvolume
>>b. fails on readable subvolume
>>
>>   What is the result reported to higher layers in the above scenario?
>> If
>>   mkdir is failed, is it cleaned up on non-readable subvolume where it
>>   failed?
>>
>> I am interested in this case as dht-preop check relies on layout
>> xattrs
>>
> and I
>
>> assume layout xattrs in particular (and all xattrs in general) are
>> guaranteed to be correct only on a readable subvolume of afr. So, in
>>
> essence
>
>> we shouldn't be winding down mkdir on non-readable subvols as whatever
>>
> the
>
>> decision brick makes as part of pre-op check is inherently flawed.
>>
>> regards,
>> Raghavendra
>>
> --
 Pranith

 ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
>
> --
> Raghavendra G
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org

Re: [Gluster-devel] dht mkdir preop check, afr and (non-)readable afr subvols

2016-05-31 Thread Pranith Kumar Karampuri
Just checked ec code. Looks okay. All entry fops are also updating metadata
and data part of the xattr.

On Tue, May 31, 2016 at 12:37 PM, Xavier Hernandez 
wrote:

> Hi,
>
> On 31/05/16 07:05, Raghavendra Gowdappa wrote:
>
>> +gluster-devel, +Xavi
>>
>> Hi all,
>>
>> The context is [1], where bricks do pre-operation checks before doing a
>> fop and proceed with fop only if pre-op check is successful.
>>
>> @Xavi,
>>
>> We need your inputs on behavior of EC subvolumes as well.
>>
>
> If I understand correctly, EC shouldn't have any problems here.
>
> EC sends the mkdir request to all subvolumes that are currently considered
> "good" and tries to combine the answers. Answers that match in return code,
> errno (if necessary) and xdata contents (except for some special xattrs
> that are ignored for combination purposes), are grouped.
>
> Then it takes the group with more members/answers. If that group has a
> minimum size of #bricks - redundancy, it is considered the good answer.
> Otherwise EIO is returned because bricks are in an inconsistent state.
>
> If there's any answer in another group, it's considered bad and gets
> marked so that self-heal will repair it using the good information from the
> majority of bricks.
>
> xdata is combined and returned even if return code is -1.
>
> Is that enough to cover the needed behavior ?
>
> Xavi
>
>
>
>> [1] http://review.gluster.org/13885
>>
>> regards,
>> Raghavendra
>>
>> - Original Message -
>>
>>> From: "Pranith Kumar Karampuri" 
>>> To: "Raghavendra Gowdappa" 
>>> Cc: "team-quine-afr" , "rhs-zteam" <
>>> rhs-zt...@redhat.com>
>>> Sent: Tuesday, May 31, 2016 10:22:49 AM
>>> Subject: Re: dht mkdir preop check, afr and (non-)readable afr subvols
>>>
>>> I think you should start a discussion on gluster-devel so that Xavi gets
>>> a
>>> chance to respond on the mails as well.
>>>
>>> On Tue, May 31, 2016 at 10:21 AM, Raghavendra Gowdappa <
>>> rgowd...@redhat.com>
>>> wrote:
>>>
>>> Also note that we've plans to extend this pre-op check to all dentry
 operations which also depend parent layout. So, the discussion need to
 cover all dentry operations like:

 1. create
 2. mkdir
 3. rmdir
 4. mknod
 5. symlink
 6. unlink
 7. rename

 We also plan to have similar checks in lock codepath for directories too
 (planning to use hashed-subvolume as lock-subvolume for directories).
 So,
 more fops :)
 8. lk (posix locks)
 9. inodelk
 10. entrylk

 regards,
 Raghavendra

 - Original Message -

> From: "Raghavendra Gowdappa" 
> To: "team-quine-afr" 
> Cc: "rhs-zteam" 
> Sent: Tuesday, May 31, 2016 10:15:04 AM
> Subject: dht mkdir preop check, afr and (non-)readable afr subvols
>
> Hi all,
>
> I have some queries related to the behavior of afr_mkdir with respect
> to
> readable subvols.
>
> 1. While winding mkdir to subvols does afr check whether the subvolume
> is
> good/readable? Or does it wind to all subvols irrespective of whether a
> subvol is good/bad? In the latter case, what if
>a. mkdir succeeds on non-readable subvolume
>b. fails on readable subvolume
>
>   What is the result reported to higher layers in the above scenario?
> If
>   mkdir is failed, is it cleaned up on non-readable subvolume where it
>   failed?
>
> I am interested in this case as dht-preop check relies on layout xattrs
>
 and I

> assume layout xattrs in particular (and all xattrs in general) are
> guaranteed to be correct only on a readable subvolume of afr. So, in
>
 essence

> we shouldn't be winding down mkdir on non-readable subvols as whatever
>
 the

> decision brick makes as part of pre-op check is inherently flawed.
>
> regards,
> Raghavendra
>
 --
>>> Pranith
>>>
>>>


-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] dht mkdir preop check, afr and (non-)readable afr subvols

2016-05-31 Thread Xavier Hernandez

Hi,

On 31/05/16 07:05, Raghavendra Gowdappa wrote:

+gluster-devel, +Xavi

Hi all,

The context is [1], where bricks do pre-operation checks before doing a fop and 
proceed with fop only if pre-op check is successful.

@Xavi,

We need your inputs on behavior of EC subvolumes as well.


If I understand correctly, EC shouldn't have any problems here.

EC sends the mkdir request to all subvolumes that are currently 
considered "good" and tries to combine the answers. Answers that match 
in return code, errno (if necessary) and xdata contents (except for some 
special xattrs that are ignored for combination purposes), are grouped.


Then it takes the group with more members/answers. If that group has a 
minimum size of #bricks - redundancy, it is considered the good answer. 
Otherwise EIO is returned because bricks are in an inconsistent state.


If there's any answer in another group, it's considered bad and gets 
marked so that self-heal will repair it using the good information from 
the majority of bricks.


xdata is combined and returned even if return code is -1.

Is that enough to cover the needed behavior ?

Xavi



[1] http://review.gluster.org/13885

regards,
Raghavendra

- Original Message -

From: "Pranith Kumar Karampuri" 
To: "Raghavendra Gowdappa" 
Cc: "team-quine-afr" , "rhs-zteam" 

Sent: Tuesday, May 31, 2016 10:22:49 AM
Subject: Re: dht mkdir preop check, afr and (non-)readable afr subvols

I think you should start a discussion on gluster-devel so that Xavi gets a
chance to respond on the mails as well.

On Tue, May 31, 2016 at 10:21 AM, Raghavendra Gowdappa 
wrote:


Also note that we've plans to extend this pre-op check to all dentry
operations which also depend parent layout. So, the discussion need to
cover all dentry operations like:

1. create
2. mkdir
3. rmdir
4. mknod
5. symlink
6. unlink
7. rename

We also plan to have similar checks in lock codepath for directories too
(planning to use hashed-subvolume as lock-subvolume for directories). So,
more fops :)
8. lk (posix locks)
9. inodelk
10. entrylk

regards,
Raghavendra

- Original Message -

From: "Raghavendra Gowdappa" 
To: "team-quine-afr" 
Cc: "rhs-zteam" 
Sent: Tuesday, May 31, 2016 10:15:04 AM
Subject: dht mkdir preop check, afr and (non-)readable afr subvols

Hi all,

I have some queries related to the behavior of afr_mkdir with respect to
readable subvols.

1. While winding mkdir to subvols does afr check whether the subvolume is
good/readable? Or does it wind to all subvols irrespective of whether a
subvol is good/bad? In the latter case, what if
   a. mkdir succeeds on non-readable subvolume
   b. fails on readable subvolume

  What is the result reported to higher layers in the above scenario? If
  mkdir is failed, is it cleaned up on non-readable subvolume where it
  failed?

I am interested in this case as dht-preop check relies on layout xattrs

and I

assume layout xattrs in particular (and all xattrs in general) are
guaranteed to be correct only on a readable subvolume of afr. So, in

essence

we shouldn't be winding down mkdir on non-readable subvols as whatever

the

decision brick makes as part of pre-op check is inherently flawed.

regards,
Raghavendra

--
Pranith


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] dht mkdir preop check, afr and (non-)readable afr subvols

2016-05-30 Thread Raghavendra Gowdappa
+gluster-devel, +Xavi

Hi all,

The context is [1], where bricks do pre-operation checks before doing a fop and 
proceed with fop only if pre-op check is successful.

@Xavi,

We need your inputs on behavior of EC subvolumes as well.

[1] http://review.gluster.org/13885

regards,
Raghavendra

- Original Message -
> From: "Pranith Kumar Karampuri" 
> To: "Raghavendra Gowdappa" 
> Cc: "team-quine-afr" , "rhs-zteam" 
> 
> Sent: Tuesday, May 31, 2016 10:22:49 AM
> Subject: Re: dht mkdir preop check, afr and (non-)readable afr subvols
> 
> I think you should start a discussion on gluster-devel so that Xavi gets a
> chance to respond on the mails as well.
> 
> On Tue, May 31, 2016 at 10:21 AM, Raghavendra Gowdappa 
> wrote:
> 
> > Also note that we've plans to extend this pre-op check to all dentry
> > operations which also depend parent layout. So, the discussion need to
> > cover all dentry operations like:
> >
> > 1. create
> > 2. mkdir
> > 3. rmdir
> > 4. mknod
> > 5. symlink
> > 6. unlink
> > 7. rename
> >
> > We also plan to have similar checks in lock codepath for directories too
> > (planning to use hashed-subvolume as lock-subvolume for directories). So,
> > more fops :)
> > 8. lk (posix locks)
> > 9. inodelk
> > 10. entrylk
> >
> > regards,
> > Raghavendra
> >
> > - Original Message -
> > > From: "Raghavendra Gowdappa" 
> > > To: "team-quine-afr" 
> > > Cc: "rhs-zteam" 
> > > Sent: Tuesday, May 31, 2016 10:15:04 AM
> > > Subject: dht mkdir preop check, afr and (non-)readable afr subvols
> > >
> > > Hi all,
> > >
> > > I have some queries related to the behavior of afr_mkdir with respect to
> > > readable subvols.
> > >
> > > 1. While winding mkdir to subvols does afr check whether the subvolume is
> > > good/readable? Or does it wind to all subvols irrespective of whether a
> > > subvol is good/bad? In the latter case, what if
> > >a. mkdir succeeds on non-readable subvolume
> > >b. fails on readable subvolume
> > >
> > >   What is the result reported to higher layers in the above scenario? If
> > >   mkdir is failed, is it cleaned up on non-readable subvolume where it
> > >   failed?
> > >
> > > I am interested in this case as dht-preop check relies on layout xattrs
> > and I
> > > assume layout xattrs in particular (and all xattrs in general) are
> > > guaranteed to be correct only on a readable subvolume of afr. So, in
> > essence
> > > we shouldn't be winding down mkdir on non-readable subvols as whatever
> > the
> > > decision brick makes as part of pre-op check is inherently flawed.
> > >
> > > regards,
> > > Raghavendra
> --
> Pranith
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel