Re: [Gluster-devel] Issue with posix locks

2019-04-01 Thread Soumya Koduri



On 4/1/19 2:23 PM, Xavi Hernandez wrote:
On Mon, Apr 1, 2019 at 10:15 AM Soumya Koduri > wrote:




On 4/1/19 10:02 AM, Pranith Kumar Karampuri wrote:
 >
 >
 > On Sun, Mar 31, 2019 at 11:29 PM Soumya Koduri
mailto:skod...@redhat.com>
 > >> wrote:
 >
 >
 >
 >     On 3/29/19 11:55 PM, Xavi Hernandez wrote:
 >      > Hi all,
 >      >
 >      > there is one potential problem with posix locks when used in a
 >      > replicated or dispersed volume.
 >      >
 >      > Some background:
 >      >
 >      > Posix locks allow any process to lock a region of a file
multiple
 >     times,
 >      > but a single unlock on a given region will release all
previous
 >     locks.
 >      > Locked regions can be different for each lock request and
they can
 >      > overlap. The resulting lock will cover the union of all locked
 >     regions.
 >      > A single unlock (the region doesn't necessarily need to
match any
 >     of the
 >      > ranges used for locking) will create a "hole" in the currently
 >     locked
 >      > region, independently of how many times a lock request covered
 >     that region.
 >      >
 >      > For this reason, the locks xlator simply combines the
locked regions
 >      > that are requested, but it doesn't track each individual
lock range.
 >      >
 >      > Under normal circumstances this works fine. But there are
some cases
 >      > where this behavior is not sufficient. For example, suppose we
 >     have a
 >      > replica 3 volume with quorum = 2. Given the special nature
of posix
 >      > locks, AFR sends the lock request sequentially to each one
of the
 >      > bricks, to avoid that conflicting lock requests from other
 >     clients could
 >      > require to unlock an already locked region on the client
that has
 >     not
 >      > got enough successful locks (i.e. quorum). An unlock here not
 >     only would
 >      > cancel the current lock request. It would also cancel any
previously
 >      > acquired lock.
 >      >
 >
 >     I may not have fully understood, please correct me. AFAIU, lk
xlator
 >     merges locks only if both the lk-owner and the client opaque
matches.
 >
 >     In the case which you have mentioned above, considering clientA
 >     acquired
 >     locks on majority of quorum (say nodeA and nodeB) and clientB
on nodeC
 >     alone- clientB now has to unlock/cancel the lock it acquired
on nodeC.
 >
 >     You are saying the it could pose a problem if there were already
 >     successful locks taken by clientB for the same region which
would get
 >     unlocked by this particular unlock request..right?
 >
 >     Assuming the previous locks acquired by clientB are shared
(otherwise
 >     clientA wouldn't have got granted lock for the same region on
nodeA &
 >     nodeB), they would still hold true on nodeA & nodeB  as the
unlock
 >     request was sent to only nodeC. Since the majority of quorum
nodes
 >     still
 >     hold the locks by clientB, this isn't serious issue IMO.
 >
 >     I haven't looked into heal part but would like to understand
if this is
 >     really an issue in normal scenarios as well.
 >
 >
 > This is how I understood the code. Consider the following case:
 > Nodes A, B, C have locks with start and end offsets: 5-15 from
mount-1
 > and lock-range 2-3 from mount-2.
 > If mount-1 requests nonblocking lock with lock-range 1-7 and in
parallel
 > lets say mount-2 issued unlock of 2-3 as well.
 >
 > nodeA got unlock from mount-2 with range 2-3 then lock from
mount-1 with
 > range 1-7, so the lock is granted and merged to give 1-15
 > nodeB got lock from mount-1 with range 1-7 before unlock of 2-3
which
 > leads to EAGAIN which will trigger unlocks on granted lock in
mount-1
 > which will end up doing unlock of 1-7 on nodeA leading to lock-range
 > 8-15 instead of the original 5-15 on nodeA. Whereas nodeB and
nodeC will
 > have range 5-15.
 >
 > Let me know if my understanding is wrong.

Both of us mentioned the same points. So in the example you gave ,
mount-1 lost its previous lock on nodeA but majority of the quorum
(nodeB and nodeC) still have the previous lock  (range: 5-15)
intact. So
this shouldn't ideally lead to any issues as other conflicting locks
are
blocked or failed by majority of the nodes (provided there are no brick
dis/re-connects).


But brick disconnects will happen (upgrades, disk failures, server 
maintenance, ...). Anyway, even without brick 

Re: [Gluster-devel] Issue with posix locks

2019-04-01 Thread Xavi Hernandez
On Mon, Apr 1, 2019 at 10:15 AM Soumya Koduri  wrote:

>
>
> On 4/1/19 10:02 AM, Pranith Kumar Karampuri wrote:
> >
> >
> > On Sun, Mar 31, 2019 at 11:29 PM Soumya Koduri  > > wrote:
> >
> >
> >
> > On 3/29/19 11:55 PM, Xavi Hernandez wrote:
> >  > Hi all,
> >  >
> >  > there is one potential problem with posix locks when used in a
> >  > replicated or dispersed volume.
> >  >
> >  > Some background:
> >  >
> >  > Posix locks allow any process to lock a region of a file multiple
> > times,
> >  > but a single unlock on a given region will release all previous
> > locks.
> >  > Locked regions can be different for each lock request and they can
> >  > overlap. The resulting lock will cover the union of all locked
> > regions.
> >  > A single unlock (the region doesn't necessarily need to match any
> > of the
> >  > ranges used for locking) will create a "hole" in the currently
> > locked
> >  > region, independently of how many times a lock request covered
> > that region.
> >  >
> >  > For this reason, the locks xlator simply combines the locked
> regions
> >  > that are requested, but it doesn't track each individual lock
> range.
> >  >
> >  > Under normal circumstances this works fine. But there are some
> cases
> >  > where this behavior is not sufficient. For example, suppose we
> > have a
> >  > replica 3 volume with quorum = 2. Given the special nature of
> posix
> >  > locks, AFR sends the lock request sequentially to each one of the
> >  > bricks, to avoid that conflicting lock requests from other
> > clients could
> >  > require to unlock an already locked region on the client that has
> > not
> >  > got enough successful locks (i.e. quorum). An unlock here not
> > only would
> >  > cancel the current lock request. It would also cancel any
> previously
> >  > acquired lock.
> >  >
> >
> > I may not have fully understood, please correct me. AFAIU, lk xlator
> > merges locks only if both the lk-owner and the client opaque matches.
> >
> > In the case which you have mentioned above, considering clientA
> > acquired
> > locks on majority of quorum (say nodeA and nodeB) and clientB on
> nodeC
> > alone- clientB now has to unlock/cancel the lock it acquired on
> nodeC.
> >
> > You are saying the it could pose a problem if there were already
> > successful locks taken by clientB for the same region which would get
> > unlocked by this particular unlock request..right?
> >
> > Assuming the previous locks acquired by clientB are shared (otherwise
> > clientA wouldn't have got granted lock for the same region on nodeA &
> > nodeB), they would still hold true on nodeA & nodeB  as the unlock
> > request was sent to only nodeC. Since the majority of quorum nodes
> > still
> > hold the locks by clientB, this isn't serious issue IMO.
> >
> > I haven't looked into heal part but would like to understand if this
> is
> > really an issue in normal scenarios as well.
> >
> >
> > This is how I understood the code. Consider the following case:
> > Nodes A, B, C have locks with start and end offsets: 5-15 from mount-1
> > and lock-range 2-3 from mount-2.
> > If mount-1 requests nonblocking lock with lock-range 1-7 and in parallel
> > lets say mount-2 issued unlock of 2-3 as well.
> >
> > nodeA got unlock from mount-2 with range 2-3 then lock from mount-1 with
> > range 1-7, so the lock is granted and merged to give 1-15
> > nodeB got lock from mount-1 with range 1-7 before unlock of 2-3 which
> > leads to EAGAIN which will trigger unlocks on granted lock in mount-1
> > which will end up doing unlock of 1-7 on nodeA leading to lock-range
> > 8-15 instead of the original 5-15 on nodeA. Whereas nodeB and nodeC will
> > have range 5-15.
> >
> > Let me know if my understanding is wrong.
>
> Both of us mentioned the same points. So in the example you gave ,
> mount-1 lost its previous lock on nodeA but majority of the quorum
> (nodeB and nodeC) still have the previous lock  (range: 5-15) intact. So
> this shouldn't ideally lead to any issues as other conflicting locks are
> blocked or failed by majority of the nodes (provided there are no brick
> dis/re-connects).
>

But brick disconnects will happen (upgrades, disk failures, server
maintenance, ...). Anyway, even without brick disconnects, in the previous
example we have nodeA with range 8-15, and nodes B and C with range 5-15.
If another lock from mount-2 comes for range 5-7, it will succeed on nodeA,
but it will block on nodeB. At this point, mount-1 could attempt a lock on
same range. It will block on nodeA, so we have a deadlock.

In general, having discrepancies between bricks is not good because sooner
or later it will cause some bad inconsistency.


> Wrt to brick disconnects/re-connects, if we can get in general lock
> 

Re: [Gluster-devel] Issue with posix locks

2019-04-01 Thread Soumya Koduri



On 4/1/19 10:02 AM, Pranith Kumar Karampuri wrote:



On Sun, Mar 31, 2019 at 11:29 PM Soumya Koduri > wrote:




On 3/29/19 11:55 PM, Xavi Hernandez wrote:
 > Hi all,
 >
 > there is one potential problem with posix locks when used in a
 > replicated or dispersed volume.
 >
 > Some background:
 >
 > Posix locks allow any process to lock a region of a file multiple
times,
 > but a single unlock on a given region will release all previous
locks.
 > Locked regions can be different for each lock request and they can
 > overlap. The resulting lock will cover the union of all locked
regions.
 > A single unlock (the region doesn't necessarily need to match any
of the
 > ranges used for locking) will create a "hole" in the currently
locked
 > region, independently of how many times a lock request covered
that region.
 >
 > For this reason, the locks xlator simply combines the locked regions
 > that are requested, but it doesn't track each individual lock range.
 >
 > Under normal circumstances this works fine. But there are some cases
 > where this behavior is not sufficient. For example, suppose we
have a
 > replica 3 volume with quorum = 2. Given the special nature of posix
 > locks, AFR sends the lock request sequentially to each one of the
 > bricks, to avoid that conflicting lock requests from other
clients could
 > require to unlock an already locked region on the client that has
not
 > got enough successful locks (i.e. quorum). An unlock here not
only would
 > cancel the current lock request. It would also cancel any previously
 > acquired lock.
 >

I may not have fully understood, please correct me. AFAIU, lk xlator
merges locks only if both the lk-owner and the client opaque matches.

In the case which you have mentioned above, considering clientA
acquired
locks on majority of quorum (say nodeA and nodeB) and clientB on nodeC
alone- clientB now has to unlock/cancel the lock it acquired on nodeC.

You are saying the it could pose a problem if there were already
successful locks taken by clientB for the same region which would get
unlocked by this particular unlock request..right?

Assuming the previous locks acquired by clientB are shared (otherwise
clientA wouldn't have got granted lock for the same region on nodeA &
nodeB), they would still hold true on nodeA & nodeB  as the unlock
request was sent to only nodeC. Since the majority of quorum nodes
still
hold the locks by clientB, this isn't serious issue IMO.

I haven't looked into heal part but would like to understand if this is
really an issue in normal scenarios as well.


This is how I understood the code. Consider the following case:
Nodes A, B, C have locks with start and end offsets: 5-15 from mount-1 
and lock-range 2-3 from mount-2.
If mount-1 requests nonblocking lock with lock-range 1-7 and in parallel 
lets say mount-2 issued unlock of 2-3 as well.


nodeA got unlock from mount-2 with range 2-3 then lock from mount-1 with 
range 1-7, so the lock is granted and merged to give 1-15
nodeB got lock from mount-1 with range 1-7 before unlock of 2-3 which 
leads to EAGAIN which will trigger unlocks on granted lock in mount-1 
which will end up doing unlock of 1-7 on nodeA leading to lock-range 
8-15 instead of the original 5-15 on nodeA. Whereas nodeB and nodeC will 
have range 5-15.


Let me know if my understanding is wrong.


Both of us mentioned the same points. So in the example you gave , 
mount-1 lost its previous lock on nodeA but majority of the quorum 
(nodeB and nodeC) still have the previous lock  (range: 5-15) intact. So 
this shouldn't ideally lead to any issues as other conflicting locks are 
blocked or failed by majority of the nodes (provided there are no brick 
dis/re-connects).


Wrt to brick disconnects/re-connects, if we can get in general lock 
healing (not getting into implementation details atm) support, that 
should take care of correcting lock range on nodeA as well right?


That said I am not suggesting that we should stick to existing behavior, 
just trying to get clarification to check if we can avoid any 
overhead/side-effects with maintaining multiple locks.


Thanks,
Soumya





Thanks,
Soumya

 > However, when something goes wrong (a brick dies during a lock
request,
 > or there's a network partition or some other weird situation), it
could
 > happen that even using sequential locking, only one brick
succeeds the
 > lock request. In this case, AFR should cancel the previous lock
(and it
 > does), but this also cancels any previously acquired lock on that
 > region, which is not good.
 >
 > A similar thing can happen if we try to recover (heal) posix
locks that
 > were active after a brick has been 

Re: [Gluster-devel] Issue with posix locks

2019-04-01 Thread Xavi Hernandez
On Sun, Mar 31, 2019 at 7:59 PM Soumya Koduri  wrote:

>
>
> On 3/29/19 11:55 PM, Xavi Hernandez wrote:
> > Hi all,
> >
> > there is one potential problem with posix locks when used in a
> > replicated or dispersed volume.
> >
> > Some background:
> >
> > Posix locks allow any process to lock a region of a file multiple times,
> > but a single unlock on a given region will release all previous locks.
> > Locked regions can be different for each lock request and they can
> > overlap. The resulting lock will cover the union of all locked regions.
> > A single unlock (the region doesn't necessarily need to match any of the
> > ranges used for locking) will create a "hole" in the currently locked
> > region, independently of how many times a lock request covered that
> region.
> >
> > For this reason, the locks xlator simply combines the locked regions
> > that are requested, but it doesn't track each individual lock range.
> >
> > Under normal circumstances this works fine. But there are some cases
> > where this behavior is not sufficient. For example, suppose we have a
> > replica 3 volume with quorum = 2. Given the special nature of posix
> > locks, AFR sends the lock request sequentially to each one of the
> > bricks, to avoid that conflicting lock requests from other clients could
> > require to unlock an already locked region on the client that has not
> > got enough successful locks (i.e. quorum). An unlock here not only would
> > cancel the current lock request. It would also cancel any previously
> > acquired lock.
> >
>
> I may not have fully understood, please correct me. AFAIU, lk xlator
> merges locks only if both the lk-owner and the client opaque matches.
>
> In the case which you have mentioned above, considering clientA acquired
> locks on majority of quorum (say nodeA and nodeB) and clientB on nodeC
> alone- clientB now has to unlock/cancel the lock it acquired on nodeC.
>
> You are saying the it could pose a problem if there were already
> successful locks taken by clientB for the same region which would get
> unlocked by this particular unlock request..right?
>

Yes


>
> Assuming the previous locks acquired by clientB are shared (otherwise
> clientA wouldn't have got granted lock for the same region on nodeA &
> nodeB), they would still hold true on nodeA & nodeB  as the unlock
> request was sent to only nodeC. Since the majority of quorum nodes still
> hold the locks by clientB, this isn't serious issue IMO.
>

Partially true. But if one of nodeA or nodeB dies or gets disconnected,
there won't be any majority of bricks with correct locks, even though there
are still 2 alive bricks. At this point, another client could successfully
acquire a lock that, in theory, is already acquired by another client.


> I haven't looked into heal part but would like to understand if this is
> really an issue in normal scenarios as well.
>

If we consider that a brick disconnection is a normal scenario (which I
think it should be on a large scale distributed file system), then this
issue exists. But even without brick disconnections we can get incorrect
results, as Pranith has just explained.

Xavi


>
> Thanks,
> Soumya
>
> > However, when something goes wrong (a brick dies during a lock request,
> > or there's a network partition or some other weird situation), it could
> > happen that even using sequential locking, only one brick succeeds the
> > lock request. In this case, AFR should cancel the previous lock (and it
> > does), but this also cancels any previously acquired lock on that
> > region, which is not good.
> >
> > A similar thing can happen if we try to recover (heal) posix locks that
> > were active after a brick has been disconnected (for any reason) and
> > then reconnected.
> >
> > To fix all these situations we need to change the way posix locks are
> > managed by locks xlator. One possibility would be to embed the lock
> > request inside an inode transaction using inodelk. Since inodelks do not
> > suffer this problem, the follwing posix lock could be sent safely.
> > However this implies an additional network request, which could cause
> > some performance impact. Eager-locking could minimize the impact in some
> > cases. However this approach won't work for lock recovery after a
> > disconnect.
> >
> > Another possibility is to send a special partial posix lock request
> > which won't be immediately merged with already existing locks once
> > granted. An additional confirmation request of the partial posix lock
> > will be required to fully grant the current lock and merge it with the
> > existing ones. This requires a new network request, which will add
> > latency, and makes everything more complex since there would be more
> > combinations of states in which something could fail.
> >
> > So I think one possible solution would be the following:
> >
> > 1. Keep each posix lock as an independent object in locks xlator. This
> > will make it possible to "invalidate" any already 

Re: [Gluster-devel] Issue with posix locks

2019-03-31 Thread Pranith Kumar Karampuri
On Sun, Mar 31, 2019 at 11:29 PM Soumya Koduri  wrote:

>
>
> On 3/29/19 11:55 PM, Xavi Hernandez wrote:
> > Hi all,
> >
> > there is one potential problem with posix locks when used in a
> > replicated or dispersed volume.
> >
> > Some background:
> >
> > Posix locks allow any process to lock a region of a file multiple times,
> > but a single unlock on a given region will release all previous locks.
> > Locked regions can be different for each lock request and they can
> > overlap. The resulting lock will cover the union of all locked regions.
> > A single unlock (the region doesn't necessarily need to match any of the
> > ranges used for locking) will create a "hole" in the currently locked
> > region, independently of how many times a lock request covered that
> region.
> >
> > For this reason, the locks xlator simply combines the locked regions
> > that are requested, but it doesn't track each individual lock range.
> >
> > Under normal circumstances this works fine. But there are some cases
> > where this behavior is not sufficient. For example, suppose we have a
> > replica 3 volume with quorum = 2. Given the special nature of posix
> > locks, AFR sends the lock request sequentially to each one of the
> > bricks, to avoid that conflicting lock requests from other clients could
> > require to unlock an already locked region on the client that has not
> > got enough successful locks (i.e. quorum). An unlock here not only would
> > cancel the current lock request. It would also cancel any previously
> > acquired lock.
> >
>
> I may not have fully understood, please correct me. AFAIU, lk xlator
> merges locks only if both the lk-owner and the client opaque matches.
>
> In the case which you have mentioned above, considering clientA acquired
> locks on majority of quorum (say nodeA and nodeB) and clientB on nodeC
> alone- clientB now has to unlock/cancel the lock it acquired on nodeC.
>
> You are saying the it could pose a problem if there were already
> successful locks taken by clientB for the same region which would get
> unlocked by this particular unlock request..right?
>
> Assuming the previous locks acquired by clientB are shared (otherwise
> clientA wouldn't have got granted lock for the same region on nodeA &
> nodeB), they would still hold true on nodeA & nodeB  as the unlock
> request was sent to only nodeC. Since the majority of quorum nodes still
> hold the locks by clientB, this isn't serious issue IMO.
>
> I haven't looked into heal part but would like to understand if this is
> really an issue in normal scenarios as well.
>

This is how I understood the code. Consider the following case:
Nodes A, B, C have locks with start and end offsets: 5-15 from mount-1 and
lock-range 2-3 from mount-2.
If mount-1 requests nonblocking lock with lock-range 1-7 and in parallel
lets say mount-2 issued unlock of 2-3 as well.

nodeA got unlock from mount-2 with range 2-3 then lock from mount-1 with
range 1-7, so the lock is granted and merged to give 1-15
nodeB got lock from mount-1 with range 1-7 before unlock of 2-3 which leads
to EAGAIN which will trigger unlocks on granted lock in mount-1 which will
end up doing unlock of 1-7 on nodeA leading to lock-range 8-15 instead of
the original 5-15 on nodeA. Whereas nodeB and nodeC will have range 5-15.

Let me know if my understanding is wrong.


> Thanks,
> Soumya
>
> > However, when something goes wrong (a brick dies during a lock request,
> > or there's a network partition or some other weird situation), it could
> > happen that even using sequential locking, only one brick succeeds the
> > lock request. In this case, AFR should cancel the previous lock (and it
> > does), but this also cancels any previously acquired lock on that
> > region, which is not good.
> >
> > A similar thing can happen if we try to recover (heal) posix locks that
> > were active after a brick has been disconnected (for any reason) and
> > then reconnected.
> >
> > To fix all these situations we need to change the way posix locks are
> > managed by locks xlator. One possibility would be to embed the lock
> > request inside an inode transaction using inodelk. Since inodelks do not
> > suffer this problem, the follwing posix lock could be sent safely.
> > However this implies an additional network request, which could cause
> > some performance impact. Eager-locking could minimize the impact in some
> > cases. However this approach won't work for lock recovery after a
> > disconnect.
> >
> > Another possibility is to send a special partial posix lock request
> > which won't be immediately merged with already existing locks once
> > granted. An additional confirmation request of the partial posix lock
> > will be required to fully grant the current lock and merge it with the
> > existing ones. This requires a new network request, which will add
> > latency, and makes everything more complex since there would be more
> > combinations of states in which something could fail.
> >
> > So I 

Re: [Gluster-devel] Issue with posix locks

2019-03-31 Thread Soumya Koduri




On 3/29/19 11:55 PM, Xavi Hernandez wrote:

Hi all,

there is one potential problem with posix locks when used in a 
replicated or dispersed volume.


Some background:

Posix locks allow any process to lock a region of a file multiple times, 
but a single unlock on a given region will release all previous locks. 
Locked regions can be different for each lock request and they can 
overlap. The resulting lock will cover the union of all locked regions. 
A single unlock (the region doesn't necessarily need to match any of the 
ranges used for locking) will create a "hole" in the currently locked 
region, independently of how many times a lock request covered that region.


For this reason, the locks xlator simply combines the locked regions 
that are requested, but it doesn't track each individual lock range.


Under normal circumstances this works fine. But there are some cases 
where this behavior is not sufficient. For example, suppose we have a 
replica 3 volume with quorum = 2. Given the special nature of posix 
locks, AFR sends the lock request sequentially to each one of the 
bricks, to avoid that conflicting lock requests from other clients could 
require to unlock an already locked region on the client that has not 
got enough successful locks (i.e. quorum). An unlock here not only would 
cancel the current lock request. It would also cancel any previously 
acquired lock.




I may not have fully understood, please correct me. AFAIU, lk xlator 
merges locks only if both the lk-owner and the client opaque matches.


In the case which you have mentioned above, considering clientA acquired 
locks on majority of quorum (say nodeA and nodeB) and clientB on nodeC 
alone- clientB now has to unlock/cancel the lock it acquired on nodeC.


You are saying the it could pose a problem if there were already 
successful locks taken by clientB for the same region which would get 
unlocked by this particular unlock request..right?


Assuming the previous locks acquired by clientB are shared (otherwise 
clientA wouldn't have got granted lock for the same region on nodeA & 
nodeB), they would still hold true on nodeA & nodeB  as the unlock 
request was sent to only nodeC. Since the majority of quorum nodes still 
hold the locks by clientB, this isn't serious issue IMO.


I haven't looked into heal part but would like to understand if this is 
really an issue in normal scenarios as well.


Thanks,
Soumya

However, when something goes wrong (a brick dies during a lock request, 
or there's a network partition or some other weird situation), it could 
happen that even using sequential locking, only one brick succeeds the 
lock request. In this case, AFR should cancel the previous lock (and it 
does), but this also cancels any previously acquired lock on that 
region, which is not good.


A similar thing can happen if we try to recover (heal) posix locks that 
were active after a brick has been disconnected (for any reason) and 
then reconnected.


To fix all these situations we need to change the way posix locks are 
managed by locks xlator. One possibility would be to embed the lock 
request inside an inode transaction using inodelk. Since inodelks do not 
suffer this problem, the follwing posix lock could be sent safely. 
However this implies an additional network request, which could cause 
some performance impact. Eager-locking could minimize the impact in some 
cases. However this approach won't work for lock recovery after a 
disconnect.


Another possibility is to send a special partial posix lock request 
which won't be immediately merged with already existing locks once 
granted. An additional confirmation request of the partial posix lock 
will be required to fully grant the current lock and merge it with the 
existing ones. This requires a new network request, which will add 
latency, and makes everything more complex since there would be more 
combinations of states in which something could fail.


So I think one possible solution would be the following:

1. Keep each posix lock as an independent object in locks xlator. This 
will make it possible to "invalidate" any already granted lock without 
affecting already established locks.


2. Additionally, we'll keep a sorted list of non-overlapping segments of 
locked regions. And we'll count, for each region, how many locks are 
referencing it. One lock can reference multiple segments, and each 
segment can be referenced by multiple locks.


3. An additional lock request that overlaps with an existing segment, 
can cause this segment to be split to satisfy the non-overlapping property.


4. When an unlock request is received, all segments intersecting with 
the region are eliminated (it may require some segment splits on the 
edges), and the unlocked region is subtracted from each lock associated 
to the segment. If a lock gets an empty region, it's removed.


5. We'll create a special "remove lock" request that doesn't unlock a 
region but removes an already 

[Gluster-devel] Issue with posix locks

2019-03-29 Thread Xavi Hernandez
Hi all,

there is one potential problem with posix locks when used in a replicated
or dispersed volume.

Some background:

Posix locks allow any process to lock a region of a file multiple times,
but a single unlock on a given region will release all previous locks.
Locked regions can be different for each lock request and they can overlap.
The resulting lock will cover the union of all locked regions. A single
unlock (the region doesn't necessarily need to match any of the ranges used
for locking) will create a "hole" in the currently locked region,
independently of how many times a lock request covered that region.

For this reason, the locks xlator simply combines the locked regions that
are requested, but it doesn't track each individual lock range.

Under normal circumstances this works fine. But there are some cases where
this behavior is not sufficient. For example, suppose we have a replica 3
volume with quorum = 2. Given the special nature of posix locks, AFR sends
the lock request sequentially to each one of the bricks, to avoid that
conflicting lock requests from other clients could require to unlock an
already locked region on the client that has not got enough successful
locks (i.e. quorum). An unlock here not only would cancel the current lock
request. It would also cancel any previously acquired lock.

However, when something goes wrong (a brick dies during a lock request, or
there's a network partition or some other weird situation), it could happen
that even using sequential locking, only one brick succeeds the lock
request. In this case, AFR should cancel the previous lock (and it does),
but this also cancels any previously acquired lock on that region, which is
not good.

A similar thing can happen if we try to recover (heal) posix locks that
were active after a brick has been disconnected (for any reason) and then
reconnected.

To fix all these situations we need to change the way posix locks are
managed by locks xlator. One possibility would be to embed the lock request
inside an inode transaction using inodelk. Since inodelks do not suffer
this problem, the follwing posix lock could be sent safely. However this
implies an additional network request, which could cause some performance
impact. Eager-locking could minimize the impact in some cases. However this
approach won't work for lock recovery after a disconnect.

Another possibility is to send a special partial posix lock request which
won't be immediately merged with already existing locks once granted. An
additional confirmation request of the partial posix lock will be required
to fully grant the current lock and merge it with the existing ones. This
requires a new network request, which will add latency, and makes
everything more complex since there would be more combinations of states in
which something could fail.

So I think one possible solution would be the following:

1. Keep each posix lock as an independent object in locks xlator. This will
make it possible to "invalidate" any already granted lock without affecting
already established locks.

2. Additionally, we'll keep a sorted list of non-overlapping segments of
locked regions. And we'll count, for each region, how many locks are
referencing it. One lock can reference multiple segments, and each segment
can be referenced by multiple locks.

3. An additional lock request that overlaps with an existing segment, can
cause this segment to be split to satisfy the non-overlapping property.

4. When an unlock request is received, all segments intersecting with the
region are eliminated (it may require some segment splits on the edges),
and the unlocked region is subtracted from each lock associated to the
segment. If a lock gets an empty region, it's removed.

5. We'll create a special "remove lock" request that doesn't unlock a
region but removes an already granted lock. This will decrease the number
of references to each of the segments this lock was covering. If some
segment reaches 0, it's removed. Otherwise it remains there. This special
request will only be used internally to cancel already acquired locks that
cannot be fully granted due to quorum issues or any other problem.

In some weird cases, the list of segments can be huge (many locks
overlapping only on a single byte, so each segment represents only one
byte). We can try to find some smarter structure that minimizes this
problem or limit the number of segments (for example returning ENOLCK when
there are too many).

What do you think ?

Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel