On Sun, Mar 31, 2019 at 11:29 PM Soumya Koduri <skod...@redhat.com> wrote:
> > > On 3/29/19 11:55 PM, Xavi Hernandez wrote: > > Hi all, > > > > there is one potential problem with posix locks when used in a > > replicated or dispersed volume. > > > > Some background: > > > > Posix locks allow any process to lock a region of a file multiple times, > > but a single unlock on a given region will release all previous locks. > > Locked regions can be different for each lock request and they can > > overlap. The resulting lock will cover the union of all locked regions. > > A single unlock (the region doesn't necessarily need to match any of the > > ranges used for locking) will create a "hole" in the currently locked > > region, independently of how many times a lock request covered that > region. > > > > For this reason, the locks xlator simply combines the locked regions > > that are requested, but it doesn't track each individual lock range. > > > > Under normal circumstances this works fine. But there are some cases > > where this behavior is not sufficient. For example, suppose we have a > > replica 3 volume with quorum = 2. Given the special nature of posix > > locks, AFR sends the lock request sequentially to each one of the > > bricks, to avoid that conflicting lock requests from other clients could > > require to unlock an already locked region on the client that has not > > got enough successful locks (i.e. quorum). An unlock here not only would > > cancel the current lock request. It would also cancel any previously > > acquired lock. > > > > I may not have fully understood, please correct me. AFAIU, lk xlator > merges locks only if both the lk-owner and the client opaque matches. > > In the case which you have mentioned above, considering clientA acquired > locks on majority of quorum (say nodeA and nodeB) and clientB on nodeC > alone- clientB now has to unlock/cancel the lock it acquired on nodeC. > > You are saying the it could pose a problem if there were already > successful locks taken by clientB for the same region which would get > unlocked by this particular unlock request..right? > > Assuming the previous locks acquired by clientB are shared (otherwise > clientA wouldn't have got granted lock for the same region on nodeA & > nodeB), they would still hold true on nodeA & nodeB as the unlock > request was sent to only nodeC. Since the majority of quorum nodes still > hold the locks by clientB, this isn't serious issue IMO. > > I haven't looked into heal part but would like to understand if this is > really an issue in normal scenarios as well. > This is how I understood the code. Consider the following case: Nodes A, B, C have locks with start and end offsets: 5-15 from mount-1 and lock-range 2-3 from mount-2. If mount-1 requests nonblocking lock with lock-range 1-7 and in parallel lets say mount-2 issued unlock of 2-3 as well. nodeA got unlock from mount-2 with range 2-3 then lock from mount-1 with range 1-7, so the lock is granted and merged to give 1-15 nodeB got lock from mount-1 with range 1-7 before unlock of 2-3 which leads to EAGAIN which will trigger unlocks on granted lock in mount-1 which will end up doing unlock of 1-7 on nodeA leading to lock-range 8-15 instead of the original 5-15 on nodeA. Whereas nodeB and nodeC will have range 5-15. Let me know if my understanding is wrong. > Thanks, > Soumya > > > However, when something goes wrong (a brick dies during a lock request, > > or there's a network partition or some other weird situation), it could > > happen that even using sequential locking, only one brick succeeds the > > lock request. In this case, AFR should cancel the previous lock (and it > > does), but this also cancels any previously acquired lock on that > > region, which is not good. > > > > A similar thing can happen if we try to recover (heal) posix locks that > > were active after a brick has been disconnected (for any reason) and > > then reconnected. > > > > To fix all these situations we need to change the way posix locks are > > managed by locks xlator. One possibility would be to embed the lock > > request inside an inode transaction using inodelk. Since inodelks do not > > suffer this problem, the follwing posix lock could be sent safely. > > However this implies an additional network request, which could cause > > some performance impact. Eager-locking could minimize the impact in some > > cases. However this approach won't work for lock recovery after a > > disconnect. > > > > Another possibility is to send a special partial posix lock request > > which won't be immediately merged with already existing locks once > > granted. An additional confirmation request of the partial posix lock > > will be required to fully grant the current lock and merge it with the > > existing ones. This requires a new network request, which will add > > latency, and makes everything more complex since there would be more > > combinations of states in which something could fail. > > > > So I think one possible solution would be the following: > > > > 1. Keep each posix lock as an independent object in locks xlator. This > > will make it possible to "invalidate" any already granted lock without > > affecting already established locks. > > > > 2. Additionally, we'll keep a sorted list of non-overlapping segments of > > locked regions. And we'll count, for each region, how many locks are > > referencing it. One lock can reference multiple segments, and each > > segment can be referenced by multiple locks. > > > > 3. An additional lock request that overlaps with an existing segment, > > can cause this segment to be split to satisfy the non-overlapping > property. > > > > 4. When an unlock request is received, all segments intersecting with > > the region are eliminated (it may require some segment splits on the > > edges), and the unlocked region is subtracted from each lock associated > > to the segment. If a lock gets an empty region, it's removed. > > > > 5. We'll create a special "remove lock" request that doesn't unlock a > > region but removes an already granted lock. This will decrease the > > number of references to each of the segments this lock was covering. If > > some segment reaches 0, it's removed. Otherwise it remains there. This > > special request will only be used internally to cancel already acquired > > locks that cannot be fully granted due to quorum issues or any other > > problem. > > > > In some weird cases, the list of segments can be huge (many locks > > overlapping only on a single byte, so each segment represents only one > > byte). We can try to find some smarter structure that minimizes this > > problem or limit the number of segments (for example returning ENOLCK > > when there are too many). > > > > What do you think ? > > > > Xavi > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel@gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-devel > > > -- Pranith
_______________________________________________ Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel