On Wed, Sep 14, 2016 at 12:08 PM, Xavier Hernandez <[email protected]> wrote:
> On 13/09/16 21:00, Pranith Kumar Karampuri wrote: > >> >> >> On Tue, Sep 13, 2016 at 1:39 PM, Xavier Hernandez <[email protected] >> <mailto:[email protected]>> wrote: >> >> Hi Sanoj, >> >> On 13/09/16 09:41, Sanoj Unnikrishnan wrote: >> >> Hi Xavi, >> >> That explains a lot, >> I see a couple of other scenario which can lead to similar >> inconsistency. >> 1) simultaneous node/brick crash of 3 bricks. >> >> >> Although this is a real problem, the 3 bricks should crash exactly >> at the same moment just after having successfully locked the inode >> being modified and queried some information, but before sending the >> write fop nor any down notification. The probability to have this >> suffer this problem is really small. >> >> 2) if the disk space of underlying filesystem on which brick is >> hosted exceeds for 3 bricks. >> >> >> Yes. This is the same cause that makes quota fail. >> >> >> I don't think we can address all the scenario unless we have a >> log/journal mechanism like raid-5. >> >> >> I completely agree. I don't see any solution valid for all cases. >> BTW RAID-5 *is not* a solution. It doesn't have any log/journal. >> Maybe something based on fdl xlator would work. >> >> Should we look at a quota specific fix or let it get fixed >> whenever we introduce a log? >> >> >> Not sure how to fix this in a way that doesn't seem too hacky. >> >> One possibility is to request permission to write some data before >> actually writing it (specifying offset and size). And then be sure >> that the write will succeed if all (or at least the minimum number >> of data bricks) has acknowledged the previous write permission >> request. >> >> Another approach would be to queue writes in a server side xlator >> until a commit message is received, but sending back an answer >> saying if there's enough space to do the write (this is, in some >> way, a very primitive log/journal approach). >> >> However both approaches will have a big performance impact if they >> cannot be executed in background. >> >> Maybe it would be worth investing in fdl instead of trying to find a >> custom solution to this. >> >> >> There are some things we shall do irrespective of this change: >> 1) When the file is in a state that all 512 bytes of the fragment >> represent the data, then we shouldn't increase the file size at all >> which discards the write without any problems, i.e. this case is >> recoverable. >> > > I don't understand this case. You mean a write in an offset smaller than > the total size of the file that doesn't increment it ? if that's the case, > a sparse file could need to allocate new blocks even if the file size > doesn't change. > I think you got it in the second point, anyways, I was mentioning the same that if the failed write doesn't touch the previous contents we can keep size as is and not increment the version,size. Example: If the file is 128KB to begin with and we append 4KB and the write fails then we keep the file size as 128KB. > > > 2) when we append data to a partially filled chunk and it fails on 3/6 >> bricks, the rest could be recovered by adjusting the file size to the >> size represented by (previous block - 1)*k, we should probably provide >> an option to do so? >> > > We could do that, but this only represents a single case that later will > also be covered by the journal. > > In any case, the solution here would be to restore previous file size > instead of (previous block - 1) * k, since this can cause the file size to > decrease. This works as long as we can assume that a failed write doesn't > touch previous contents. > > 3) Proivde some utility/setfattr to perform recovery based on data >> rather than versions. i.e. it needs to detect and tell which part of >> data is not recoverable and which can be. Based on that, the user should >> be able to recover. >> > > This is not possible with the current implementation, at least in an > efficient way. Only way to detect inconsistencies right now would be to > create all possible combinations of k bricks and compute the decoded data > for each of them. Then check if everything matches. If so, the block is > healthy, otherwise there is at least one damaged fragment. Then it would be > necessary to find a relation between the fragments used for each > reconstruction and the obtained data to determine a probable candidate to > be damaged. > > Anyway, if there are 3 bad fragments in a 4+2 configuration, it won't be > possible to recover the data for that block. Of course, in this particular > case, this would mean that we would be able to recover all file except the > last written block. > > With syndrome decoding (not currently implemented) and using a new > encoding matrix, this could be done in an efficient way. > > Xavi > > >> What do you guys think? >> >> >> Xavi >> >> >> >> Thanks and Regards, >> Sanoj >> >> ----- Original Message ----- >> From: "Xavier Hernandez" <[email protected] >> <mailto:[email protected]>> >> To: "Raghavendra Gowdappa" <[email protected] >> <mailto:[email protected]>>, "Sanoj Unnikrishnan" >> <[email protected] <mailto:[email protected]>> >> Cc: "Pranith Kumar Karampuri" <[email protected] >> <mailto:[email protected]>>, "Ashish Pandey" >> <[email protected] <mailto:[email protected]>>, "Gluster >> Devel" <[email protected] >> <mailto:[email protected]>> >> >> Sent: Tuesday, September 13, 2016 11:50:27 AM >> Subject: Re: Need help with >> https://bugzilla.redhat.com/show_bug.cgi?id=1224180 >> <https://bugzilla.redhat.com/show_bug.cgi?id=1224180> >> >> Hi Sanoj, >> >> I'm unable to see bug 1224180. Access is restricted. >> >> Not sure what is the problem exactly, but I see that quota is >> involved. >> Currently disperse doesn't play well with quota when the limit >> is near. >> >> The reason is that not all bricks fail at the same time with >> EDQUOT due >> to small differences is computed space. This causes a valid write >> to >> succeed on some bricks and fail on others. If it fails >> simultaneously on >> more than redundancy bricks but less that the number of data >> bricks, >> there's no way to rollback the changes on the bricks that have >> succeeded, so the operation is inconsistent and an I/O error is >> returned. >> >> For example, on a 6:2 configuration (4 data bricks and 2 >> redundancy), if >> 3 bricks succeed and 3 fail, there are not enough bricks with the >> updated version, but there aren't enough bricks with the old >> version either. >> >> If you force 2 bricks to be down, the problem can appear more >> frequently >> as only a single failure causes this problem. >> >> Xavi >> >> On 13/09/16 06:09, Raghavendra Gowdappa wrote: >> >> +gluster-devel >> >> ----- Original Message ----- >> >> From: "Sanoj Unnikrishnan" <[email protected] >> <mailto:[email protected]>> >> To: "Pranith Kumar Karampuri" <[email protected] >> <mailto:[email protected]>>, "Ashish Pandey" >> <[email protected] <mailto:[email protected]>>, >> [email protected] <mailto:[email protected]>, >> "Raghavendra Gowdappa" <[email protected] >> <mailto:[email protected]>> >> >> Sent: Monday, September 12, 2016 7:06:59 PM >> Subject: Need help with >> https://bugzilla.redhat.com/show_bug.cgi?id=1224180 >> <https://bugzilla.redhat.com/show_bug.cgi?id=1224180> >> >> Hello Xavi/Pranith, >> >> I have been able to reproduce the BZ with the following >> steps: >> >> gluster volume create v_disp disperse 6 redundancy 2 >> $tm1:/export/sdb/br1 >> $tm2:/export/sdb/b2 $tm3:/export/sdb/br3 >> $tm1:/export/sdb/b4 >> $tm2:/export/sdb/b5 $tm3:/export/sdb/b6 force >> #(Used only 3 nodes, should not matter here) >> gluster volume start v_disp >> mount -t glusterfs $tm1:v_disp /gluster_vols/v_disp >> mkdir /gluster_vols/v_disp/dir1 >> dd if=/dev/zero of=/gluster_vols/v_disp/dir1/x bs=10k >> count=90000 & >> gluster v quota v_disp enable >> gluster v quota v_disp limit-usage /dir1 200MB >> gluster v quota v_disp soft-timeout 0 >> gluster v quota v_disp hard-timeout 0 >> #optional remove 2 bricks (reproduces more often with >> this) >> #pgrep glusterfsd | xargs kill -9 >> >> IO error on stdout when Quota exceeds, followed by Disk >> Quota exceeded. >> >> Also note the issue is seen when A flush happens >> simultaneous with quota >> limit hit, Hence Its not seen only on some runs. >> >> The following are the error in logs. >> [2016-09-12 10:40:02.431568] E [MSGID: 122034] >> [ec-common.c:488:ec_child_select] 0-v_disp-disperse-0: >> Insufficient >> available childs for this request (have 0, need 4) >> [2016-09-12 10:40:02.431627] E [MSGID: 122037] >> [ec-common.c:1830:ec_update_size_version_done] >> 0-Disperse: sku-debug: >> pre-version=0/0, size=0post-version=1865/1865, >> size=209571840 >> [2016-09-12 10:40:02.431637] E [MSGID: 122037] >> [ec-common.c:1835:ec_update_size_version_done] >> 0-v_disp-disperse-0: Failed >> to update version and size [Input/output error] >> [2016-09-12 10:40:02.431664] E [MSGID: 122034] >> [ec-common.c:417:ec_child_select] 0-v_disp-disperse-0: >> sku-debug: mask: 36, >> ec->xl_up 36, ec->node_mask 3f, parent->mask:36, >> fop->parent->healing:0, >> id:29 >> >> [2016-09-12 10:40:02.431673] E [MSGID: 122034] >> [ec-common.c:480:ec_child_select] 0-v_disp-disperse-0: >> sku-debug: mask: 36, >> remaining: 36, healing: 0, ec->xl_up 36, ec->node_mask >> 3f, parent->mask:36, >> num:4, minimum: 1, id:29 >> >> ... >> [2016-09-12 10:40:02.487302] W >> [fuse-bridge.c:2311:fuse_writev_cbk] >> 0-glusterfs-fuse: 41159: WRITE => -1 >> gfid=ee0b4aa1-1f44-486a-883c-acddc13ee318 >> fd=0x7f1d9c003edc (Input/output >> error) >> [2016-09-12 10:40:02.500151] W [MSGID: 122006] >> [ec-combine.c:206:ec_iatt_combine] 0-v_disp-disperse-0: >> Failed to combine >> iatt (inode: 9816911356190712600-9816911356190712600, >> links: 1-1, uid: 0-0, >> gid: 0-0, rdev: 0-0, size: 52423680-52413440, mode: >> 100644-100644) >> [2016-09-12 10:40:02.500188] N [MSGID: 122029] >> [ec-combine.c:93:ec_combine_write] 0-v_disp-disperse-0: >> Mismatching iatt in >> answers of 'WRITE' >> [2016-09-12 10:40:02.504551] W [MSGID: 122006] >> [ec-combine.c:206:ec_iatt_combine] 0-v_disp-disperse-0: >> Failed to combine >> iatt (inode: 9816911356190712600-9816911356190712600, >> links: 1-1, uid: 0-0, >> gid: 0-0, rdev: 0-0, size: 52423680-52413440, mode: >> 100644-100644) >> .... >> .... >> >> [2016-09-12 10:40:02.571272] N [MSGID: 122029] >> [ec-combine.c:93:ec_combine_write] 0-v_disp-disperse-0: >> Mismatching iatt in >> answers of 'WRITE' >> [2016-09-12 10:40:02.571510] W [MSGID: 122006] >> [ec-combine.c:206:ec_iatt_combine] 0-v_disp-disperse-0: >> Failed to combine >> iatt (inode: 9816911356190712600-9816911356190712600, >> links: 1-1, uid: 0-0, >> gid: 0-0, rdev: 0-0, size: 52423680-52413440, mode: >> 100644-100644) >> [2016-09-12 10:40:02.571544] N [MSGID: 122029] >> [ec-combine.c:93:ec_combine_write] 0-v_disp-disperse-0: >> Mismatching iatt in >> answers of 'WRITE' >> [2016-09-12 10:40:02.571772] W >> [fuse-bridge.c:1290:fuse_err_cbk] >> 0-glusterfs-fuse: 41160: FLUSH() ERR => -1 (Input/output >> error) >> >> Also, for some fops before the write I noticed the >> fop->mask field as 0, Its >> not clear why this happens ?? >> >> [2016-09-12 10:40:02.431561] E [MSGID: 122034] >> [ec-common.c:480:ec_child_select] 0-v_disp-disperse-0: >> sku-debug: mask: 0, >> remaining: 0, healing: 0, ec->xl_up 36, ec->node_mask >> 3f, parent->mask:36, >> num:0, minimum: 4, fop->id:34 >> [2016-09-12 10:40:02.431568] E [MSGID: 122034] >> [ec-common.c:488:ec_child_select] 0-v_disp-disperse-0: >> Insufficient >> available childs for this request (have 0, need 4) >> [2016-09-12 10:40:02.431637] E [MSGID: 122037] >> [ec-common.c:1835:ec_update_size_version_done] >> 0-v_disp-disperse-0: Failed >> to update version and size [Input/output error] >> >> Is the zero value of fop->mask related to mismatch in >> iatt ? >> Any scenario of race between write/flush fop? >> please suggest how to proceed. >> >> Thanks and Regards, >> Sanoj >> >> >> >> >> -- >> Pranith >> > -- Pranith
_______________________________________________ Gluster-devel mailing list [email protected] http://www.gluster.org/mailman/listinfo/gluster-devel
