show_bug.cgi?id=1224180

Pranith Kumar Karampuri Tue, 13 Sep 2016 23:43:04 -0700

On Wed, Sep 14, 2016 at 12:08 PM, Xavier Hernandez <[email protected]>
wrote:


> On 13/09/16 21:00, Pranith Kumar Karampuri wrote:
>
>>
>>
>> On Tue, Sep 13, 2016 at 1:39 PM, Xavier Hernandez <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>     Hi Sanoj,
>>
>>     On 13/09/16 09:41, Sanoj Unnikrishnan wrote:
>>
>>         Hi Xavi,
>>
>>         That explains a lot,
>>         I see a couple of other scenario which can lead to similar
>>         inconsistency.
>>         1) simultaneous node/brick crash of 3 bricks.
>>
>>
>>     Although this is a real problem, the 3 bricks should crash exactly
>>     at the same moment just after having successfully locked the inode
>>     being modified and queried some information, but before sending the
>>     write fop nor any down notification. The probability to have this
>>     suffer this problem is really small.
>>
>>         2) if the disk space of underlying filesystem on which brick is
>>         hosted exceeds for 3 bricks.
>>
>>
>>     Yes. This is the same cause that makes quota fail.
>>
>>
>>         I don't think we can address all the scenario unless we have a
>>         log/journal mechanism like raid-5.
>>
>>
>>     I completely agree. I don't see any solution valid for all cases.
>>     BTW RAID-5 *is not* a solution. It doesn't have any log/journal.
>>     Maybe something based on fdl xlator would work.
>>
>>         Should we look at a quota specific fix or let it get fixed
>>         whenever we introduce a log?
>>
>>
>>     Not sure how to fix this in a way that doesn't seem too hacky.
>>
>>     One possibility is to request permission to write some data before
>>     actually writing it (specifying offset and size). And then be sure
>>     that the write will succeed if all (or at least the minimum number
>>     of data bricks) has acknowledged the previous write permission
>> request.
>>
>>     Another approach would be to queue writes in a server side xlator
>>     until a commit message is received, but sending back an answer
>>     saying if there's enough space to do the write (this is, in some
>>     way, a very primitive log/journal approach).
>>
>>     However both approaches will have a big performance impact if they
>>     cannot be executed in background.
>>
>>     Maybe it would be worth investing in fdl instead of trying to find a
>>     custom solution to this.
>>
>>
>> There are some things we shall do irrespective of this change:
>> 1) When the file is in a state that all 512 bytes of the fragment
>> represent the data, then we shouldn't increase the file size at all
>> which discards the write without any problems, i.e. this case is
>> recoverable.
>>
>
> I don't understand this case. You mean a write in an offset smaller than
> the total size of the file that doesn't increment it ? if that's the case,
> a sparse file could need to allocate new blocks even if the file size
> doesn't change.
>

I think you got it in the second point, anyways, I was mentioning the same
that if the failed write doesn't touch the previous contents we can keep
size as is and not increment the version,size. Example: If the file is
128KB to begin with and we append 4KB and the write fails then we keep the
file size as 128KB.


>
>
> 2) when we append data to a partially filled chunk and it fails on 3/6
>> bricks, the rest could be recovered by adjusting the file size to the
>> size represented by (previous block - 1)*k, we should probably provide
>> an option to do so?
>>
>
> We could do that, but this only represents a single case that later will
> also be covered by the journal.
>
> In any case, the solution here would be to restore previous file size
> instead of (previous block - 1) * k, since this can cause the file size to
> decrease. This works as long as we can assume that a failed write doesn't
> touch previous contents.
>
> 3) Proivde some utility/setfattr to perform recovery based on data
>> rather than versions. i.e. it needs to detect and tell which part of
>> data is not recoverable and which can be. Based on that, the user should
>> be able to recover.
>>
>
> This is not possible with the current implementation, at least in an
> efficient way. Only way to detect inconsistencies right now would be to
> create all possible combinations of k bricks and compute the decoded data
> for each of them. Then check if everything matches. If so, the block is
> healthy, otherwise there is at least one damaged fragment. Then it would be
> necessary to find a relation between the fragments used for each
> reconstruction and the obtained data to determine a probable candidate to
> be damaged.
>
> Anyway, if there are 3 bad fragments in a 4+2 configuration, it won't be
> possible to recover the data for that block. Of course, in this particular
> case, this would mean that we would be able to recover all file except the
> last written block.
>
> With syndrome decoding (not currently implemented) and using a new
> encoding matrix, this could be done in an efficient way.
>
> Xavi
>
>
>> What do you guys think?
>>
>>
>>     Xavi
>>
>>
>>
>>         Thanks and Regards,
>>         Sanoj
>>
>>         ----- Original Message -----
>>         From: "Xavier Hernandez" <[email protected]
>>         <mailto:[email protected]>>
>>         To: "Raghavendra Gowdappa" <[email protected]
>>         <mailto:[email protected]>>, "Sanoj Unnikrishnan"
>>         <[email protected] <mailto:[email protected]>>
>>         Cc: "Pranith Kumar Karampuri" <[email protected]
>>         <mailto:[email protected]>>, "Ashish Pandey"
>>         <[email protected] <mailto:[email protected]>>, "Gluster
>>         Devel" <[email protected]
>>         <mailto:[email protected]>>
>>
>>         Sent: Tuesday, September 13, 2016 11:50:27 AM
>>         Subject: Re: Need help with
>>         https://bugzilla.redhat.com/show_bug.cgi?id=1224180
>>         <https://bugzilla.redhat.com/show_bug.cgi?id=1224180>
>>
>>         Hi Sanoj,
>>
>>         I'm unable to see bug 1224180. Access is restricted.
>>
>>         Not sure what is the problem exactly, but I see that quota is
>>         involved.
>>         Currently disperse doesn't play well with quota when the limit
>>         is near.
>>
>>         The reason is that not all bricks fail at the same time with
>>         EDQUOT due
>>         to small differences is computed space. This causes a valid write
>> to
>>         succeed on some bricks and fail on others. If it fails
>>         simultaneously on
>>         more than redundancy bricks but less that the number of data
>> bricks,
>>         there's no way to rollback the changes on the bricks that have
>>         succeeded, so the operation is inconsistent and an I/O error is
>>         returned.
>>
>>         For example, on a 6:2 configuration (4 data bricks and 2
>>         redundancy), if
>>         3 bricks succeed and 3 fail, there are not enough bricks with the
>>         updated version, but there aren't enough bricks with the old
>>         version either.
>>
>>         If you force 2 bricks to be down, the problem can appear more
>>         frequently
>>         as only a single failure causes this problem.
>>
>>         Xavi
>>
>>         On 13/09/16 06:09, Raghavendra Gowdappa wrote:
>>
>>             +gluster-devel
>>
>>             ----- Original Message -----
>>
>>                 From: "Sanoj Unnikrishnan" <[email protected]
>>                 <mailto:[email protected]>>
>>                 To: "Pranith Kumar Karampuri" <[email protected]
>>                 <mailto:[email protected]>>, "Ashish Pandey"
>>                 <[email protected] <mailto:[email protected]>>,
>>                 [email protected] <mailto:[email protected]>,
>>                 "Raghavendra Gowdappa" <[email protected]
>>                 <mailto:[email protected]>>
>>
>>                 Sent: Monday, September 12, 2016 7:06:59 PM
>>                 Subject: Need help with
>>                 https://bugzilla.redhat.com/show_bug.cgi?id=1224180
>>                 <https://bugzilla.redhat.com/show_bug.cgi?id=1224180>
>>
>>                 Hello Xavi/Pranith,
>>
>>                 I have been able to reproduce the BZ with the following
>>                 steps:
>>
>>                 gluster volume create v_disp disperse 6 redundancy 2
>>                 $tm1:/export/sdb/br1
>>                 $tm2:/export/sdb/b2 $tm3:/export/sdb/br3
>>                 $tm1:/export/sdb/b4
>>                 $tm2:/export/sdb/b5 $tm3:/export/sdb/b6 force
>>                 #(Used only 3 nodes, should not matter here)
>>                 gluster volume start v_disp
>>                 mount -t glusterfs $tm1:v_disp /gluster_vols/v_disp
>>                 mkdir /gluster_vols/v_disp/dir1
>>                 dd if=/dev/zero of=/gluster_vols/v_disp/dir1/x bs=10k
>>                 count=90000 &
>>                 gluster v quota v_disp enable
>>                 gluster v quota v_disp limit-usage /dir1 200MB
>>                 gluster v quota v_disp soft-timeout 0
>>                 gluster v quota v_disp hard-timeout 0
>>                 #optional remove 2 bricks (reproduces more often with
>> this)
>>                 #pgrep glusterfsd | xargs kill -9
>>
>>                 IO error on stdout when Quota exceeds, followed by Disk
>>                 Quota exceeded.
>>
>>                 Also note the issue is seen when A flush happens
>>                 simultaneous with quota
>>                 limit hit, Hence Its not seen only on some runs.
>>
>>                 The following are the error in logs.
>>                 [2016-09-12 10:40:02.431568] E [MSGID: 122034]
>>                 [ec-common.c:488:ec_child_select] 0-v_disp-disperse-0:
>>                 Insufficient
>>                 available childs for this request (have 0, need 4)
>>                 [2016-09-12 10:40:02.431627] E [MSGID: 122037]
>>                 [ec-common.c:1830:ec_update_size_version_done]
>>                 0-Disperse: sku-debug:
>>                 pre-version=0/0, size=0post-version=1865/1865,
>>                 size=209571840
>>                 [2016-09-12 10:40:02.431637] E [MSGID: 122037]
>>                 [ec-common.c:1835:ec_update_size_version_done]
>>                 0-v_disp-disperse-0: Failed
>>                 to update version and size [Input/output error]
>>                 [2016-09-12 10:40:02.431664] E [MSGID: 122034]
>>                 [ec-common.c:417:ec_child_select] 0-v_disp-disperse-0:
>>                 sku-debug: mask: 36,
>>                 ec->xl_up 36, ec->node_mask 3f, parent->mask:36,
>>                 fop->parent->healing:0,
>>                 id:29
>>
>>                 [2016-09-12 10:40:02.431673] E [MSGID: 122034]
>>                 [ec-common.c:480:ec_child_select] 0-v_disp-disperse-0:
>>                 sku-debug: mask: 36,
>>                 remaining: 36, healing: 0, ec->xl_up 36, ec->node_mask
>>                 3f, parent->mask:36,
>>                 num:4, minimum: 1, id:29
>>
>>                 ...
>>                 [2016-09-12 10:40:02.487302] W
>>                 [fuse-bridge.c:2311:fuse_writev_cbk]
>>                 0-glusterfs-fuse: 41159: WRITE => -1
>>                 gfid=ee0b4aa1-1f44-486a-883c-acddc13ee318
>>                 fd=0x7f1d9c003edc (Input/output
>>                 error)
>>                 [2016-09-12 10:40:02.500151] W [MSGID: 122006]
>>                 [ec-combine.c:206:ec_iatt_combine] 0-v_disp-disperse-0:
>>                 Failed to combine
>>                 iatt (inode: 9816911356190712600-9816911356190712600,
>>                 links: 1-1, uid: 0-0,
>>                 gid: 0-0, rdev: 0-0, size: 52423680-52413440, mode:
>>                 100644-100644)
>>                 [2016-09-12 10:40:02.500188] N [MSGID: 122029]
>>                 [ec-combine.c:93:ec_combine_write] 0-v_disp-disperse-0:
>>                 Mismatching iatt in
>>                 answers of 'WRITE'
>>                 [2016-09-12 10:40:02.504551] W [MSGID: 122006]
>>                 [ec-combine.c:206:ec_iatt_combine] 0-v_disp-disperse-0:
>>                 Failed to combine
>>                 iatt (inode: 9816911356190712600-9816911356190712600,
>>                 links: 1-1, uid: 0-0,
>>                 gid: 0-0, rdev: 0-0, size: 52423680-52413440, mode:
>>                 100644-100644)
>>                 ....
>>                 ....
>>
>>                 [2016-09-12 10:40:02.571272] N [MSGID: 122029]
>>                 [ec-combine.c:93:ec_combine_write] 0-v_disp-disperse-0:
>>                 Mismatching iatt in
>>                 answers of 'WRITE'
>>                 [2016-09-12 10:40:02.571510] W [MSGID: 122006]
>>                 [ec-combine.c:206:ec_iatt_combine] 0-v_disp-disperse-0:
>>                 Failed to combine
>>                 iatt (inode: 9816911356190712600-9816911356190712600,
>>                 links: 1-1, uid: 0-0,
>>                 gid: 0-0, rdev: 0-0, size: 52423680-52413440, mode:
>>                 100644-100644)
>>                 [2016-09-12 10:40:02.571544] N [MSGID: 122029]
>>                 [ec-combine.c:93:ec_combine_write] 0-v_disp-disperse-0:
>>                 Mismatching iatt in
>>                 answers of 'WRITE'
>>                 [2016-09-12 10:40:02.571772] W
>>                 [fuse-bridge.c:1290:fuse_err_cbk]
>>                 0-glusterfs-fuse: 41160: FLUSH() ERR => -1 (Input/output
>>                 error)
>>
>>                 Also, for some fops before the write I noticed the
>>                 fop->mask field as 0, Its
>>                 not clear why this happens ??
>>
>>                 [2016-09-12 10:40:02.431561] E [MSGID: 122034]
>>                 [ec-common.c:480:ec_child_select] 0-v_disp-disperse-0:
>>                 sku-debug: mask: 0,
>>                 remaining: 0, healing: 0, ec->xl_up 36, ec->node_mask
>>                 3f, parent->mask:36,
>>                 num:0, minimum: 4, fop->id:34
>>                 [2016-09-12 10:40:02.431568] E [MSGID: 122034]
>>                 [ec-common.c:488:ec_child_select] 0-v_disp-disperse-0:
>>                 Insufficient
>>                 available childs for this request (have 0, need 4)
>>                 [2016-09-12 10:40:02.431637] E [MSGID: 122037]
>>                 [ec-common.c:1835:ec_update_size_version_done]
>>                 0-v_disp-disperse-0: Failed
>>                 to update version and size [Input/output error]
>>
>>                 Is the zero value of fop->mask related to mismatch in
>> iatt ?
>>                 Any scenario of race between write/flush fop?
>>                 please suggest how to proceed.
>>
>>                 Thanks and Regards,
>>                 Sanoj
>>
>>
>>
>>
>> --
>> Pranith
>>
>


-- 
Pranith

_______________________________________________
Gluster-devel mailing list
[email protected]
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Need help with https://bugzilla.redhat.com/show_bug.cgi?id=1224180

Reply via email to