On Tue, Feb 19, 2013 at 3:59 AM, Pranith Kumar K <pkara...@redhat.com>wrote:
> On 02/19/2013 11:26 AM, Anand Avati wrote: > > Thinking over this, looks like there is a problem! > > Write-behind guarantees: That a second write request arriving after the > acknowledgement of a first overlapping request (whether written-behind or > otherwise) will be guaranteed to be fulfilled in the backend in the same > order (i.e, the second overlapping request will be "serialized" behind the > first one in the fulfillment process) > > Eager-lock requirement: That write-behind will send no two write requests > on an overlapping region at the same time. > > The requirement-set and guarantee-set have a big overlap, but the > requirement-set is not a subset. > > This is because of O_SYNC writes. write-behind performs > write-serialization at fulfillment only for written behind requests (which > get covered under the conflict detection code during liability > fulfillment). However, if two threads (or apps) issue overlapping O_SYNC > writes to the same region at approx same time, then write-behind will let > both of them go by without any kind of serialization, into eager lock, > violating the assumptions! > > I'm wondering if it is a safer idea to implement overlap checks within > eager-lock code itself rather than depend on write-behind :| > > Avati > > On Mon, Feb 11, 2013 at 10:07 PM, Anand Avati <anand.av...@gmail.com>wrote: > >> >> >> On Mon, Feb 11, 2013 at 9:32 PM, Pranith Kumar K <pkara...@redhat.com>wrote: >> >>> hi, >>> Please note that this is a case in theory and I did not run into such >>> situation, but I feel it is important to address this. >>> Configuration with 'Eager-lock on" and "write-behind off" should not be >>> allowed as it leads to lock synchronization problems which lead to data >>> in-consistency among replicas in nfs. >>> lets say bricks b1, b2 are in replication. >>> Gluster Nfs server uses 1 anonymous fd to perform all write-fops. If >>> eager-lock is enabled in afr, the lock-owner is used as fd's address which >>> will be same for all write-fops, so there will never be any inodelk >>> contention. If write-behind is disabled, there can be writes that overlap. >>> (Does nfs makes sure that the ranges don't overlap?) >>> >>> Now imagine the following scenario: >>> lets say w1, w2 are 2 write fops on same offset and length. w1 with all >>> '0's and w2 with all '1's. If these 2 write fops are executed in 2 >>> different threads, the order of arrival of write fops on b1 can be w1, w2 >>> where as on b2 it is w2, w1 leading to data inconsistency between the two >>> replicas. The lock contention will not happen as both lk-owner, transport >>> are same for these 2 fops. >>> >> >> Write-behind has to functions - a) performing operations in the >> background and b) serializing overlapping operations. >> >> While the problem does exist, the specifics are different from what you >> describe. since all writes coming in from NFS will always use the same >> anonymous FD, two near-in-time/overlapping writes will never contend with >> inodelk() but instead the second write will inherit the lock and changelog >> from the first. In either case, it is a problem. >> >> >>> We can add a check in glusterd for volume set to disallow such >>> configuration, BUT by default write-behind is off in nfs graph and by >>> default eager-lock is on. So we should either turn on write-behind for nfs >>> or turn off eager-lock by default. >>> >>> Could you please suggest how to proceed with this if you agree that I >>> did not miss any important detail that makes this theory invalid. >>> >> >> It seems loading write-behind xlator in NFS graph looks like a simpler >> solution. eager-locking is crucial for replicated NFS write performance. >> >> Avati >> > > Shall we disable eager-lock for files opened with O_SYNC, for now? > Bad news: the problem is slightly worse than just this. Even with non-O_SYNC writes, there is a possibility in write-behind where, if a second overlapping write request comes so close to the first request that, if wb_enqueue() of the second one happens after wb_enqueue() of the first write, but before any unwind() after the first wb_enqueue() (i.e wb_inode->gen is not bumped), then the two write requests can be wound down together to eager lock. Avati
_______________________________________________ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel