Re: [Gluster-devel] test failure reports for last 15 days

Pranith Kumar Karampuri Mon, 15 Apr 2019 02:09:01 -0700

On Thu, Apr 11, 2019 at 2:59 PM Xavi Hernandez <[email protected]> wrote:


> On Wed, Apr 10, 2019 at 7:25 PM Xavi Hernandez <[email protected]>
> wrote:
>
>> On Wed, Apr 10, 2019 at 4:01 PM Atin Mukherjee <[email protected]>
>> wrote:
>>
>>> And now for last 15 days:
>>>
>>>
>>> https://fstat.gluster.org/summary?start_date=2019-03-25&end_date=2019-04-10
>>>
>>> ./tests/bitrot/bug-1373520.t     18  ==> Fixed through
>>> https://review.gluster.org/#/c/glusterfs/+/22481/, I don't see this
>>> failing in brick mux post 5th April
>>> ./tests/bugs/ec/bug-1236065.t     17  ==> happens only in brick mux,
>>> needs analysis.
>>>
>>
>> I've identified the problem here, but not the cause yet. There's a stale
>> inodelk acquired by a process that is already dead, which causes inodelk
>> requests from self-heal and other processes to block.
>>
>> The reason why it seemed to block in random places is that all commands
>> are executed with the working directory pointing to a gluster directory
>> which needs healing after the initial tests. Because of the stale inodelk,
>> when any application tries to open a file in the working directory, it's
>> blocked.
>>
>> I'll investigate what causes this.
>>
>
> I think I've found the problem. This is a fragment of the brick log that
> includes script steps, connections and disconnections of brick 0, and lock
> requests to the problematic lock:
>
> [2019-04-11 08:22:20.381398]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t:
> TEST: 66 kill_brick patchy jahernan /d/backends/patchy2 ++++++++++
> [2019-04-11 08:22:22.532646]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t:
> TEST: 67 kill_brick patchy jahernan /d/backends/patchy3 ++++++++++
> [2019-04-11 08:22:23.709655] I [MSGID: 115029]
> [server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client
> from
> CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2
> (version: 7dev) with subvol /d/backends/patchy1
> [2019-04-11 08:22:23.792204] I [common.c:234:pl_trace_in] 8-patchy-locks:
> [REQUEST] Locker = {Pid=29710, lk-owner=68580998b47f0000,
> Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2,
> Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
> path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain:
> patchy-disperse-0, start=0, len=0, pid=0}
> [2019-04-11 08:22:23.792299] I [common.c:285:pl_trace_out] 8-patchy-locks:
> [GRANTED] Locker = {Pid=29710, lk-owner=68580998b47f0000,
> Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2,
> Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
> path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain:
> patchy-disperse-0, start=0, len=0, pid=0}
> [2019-04-11 08:22:24.628478]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t:
> TEST: 68 5 online_brick_count ++++++++++
> [2019-04-11 08:22:26.097092]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t:
> TEST: 70 rm -f 0.o 10.o 11.o 12.o 13.o 14.o 15.o 16.o 17.o 18.o 19.o 1.o
> 2.o 3.o 4.o 5.o 6.o 7.o 8.o 9.o ++++++++++
> [2019-04-11 08:22:26.333740]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t:
> TEST: 71 ec_test_make ++++++++++
> [2019-04-11 08:22:27.718963] I [MSGID: 115029]
> [server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client
> from
> CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3
> (version: 7dev) with subvol /d/backends/patchy1
> [2019-04-11 08:22:27.801416] I [common.c:234:pl_trace_in] 8-patchy-locks:
> [REQUEST] Locker = {Pid=29885, lk-owner=68580998b47f0000,
> Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3,
> Frame=19233} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
> path=/test} Lock = {lock=INODELK, cmd=SETLK, type=UNLOCK, domain:
> patchy-disperse-0, start=0, len=0, pid=0}
> [2019-04-11 08:22:27.801434] E [inodelk.c:513:__inode_unlock_lock]
> 8-patchy-locks:  Matching lock not found for unlock 0-9223372036854775807,
> by 68580998b47f0000 on 0x7f0ed0029190
> [2019-04-11 08:22:27.801446] I [common.c:285:pl_trace_out] 8-patchy-locks:
> [Invalid argument] Locker = {Pid=29885, lk-owner=68580998b47f0000,
> Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3,
> Frame=19233} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
> path=/test} Lock = {lock=INODELK, cmd=SETLK, type=UNLOCK, domain:
> patchy-disperse-0, start=0, len=0, pid=0}
>
> This is a fragment of the client log:
>
> [2019-04-11 08:22:20.381398]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t:
> TEST: 66 kill_brick patchy jahernan /d/backends/patchy2 ++++++++++
> [2019-04-11 08:22:20.675938] I [MSGID: 114018]
> [client.c:2333:client_rpc_notify] 0-patchy-client-1: disconnected from
> patchy-client-1. Client process will keep trying to connect to glusterd
> until brick's port is available
> [2019-04-11 08:22:21.674772] W [MSGID: 122035]
> [ec-common.c:654:ec_child_select] 0-patchy-disperse-0: Executing operation
> with some subvolumes unavailable. (6). FOP : 'INODELK' failed on '/test'
> with gfid 35743386-b7c2-41c9-aafd-6b13de216704
> [2019-04-11 08:22:22.532646]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t:
> TEST: 67 kill_brick patchy jahernan /d/backends/patchy3 ++++++++++
> [2019-04-11 08:22:23.691171] W [MSGID: 122035]
> [ec-common.c:654:ec_child_select] 0-patchy-disperse-0: Executing operation
> with some subvolumes unavailable. (8). FOP : 'INODELK' failed on '/test'
> with gfid 35743386-b7c2-41c9-aafd-6b13de216704
> [2019-04-11 08:22:23.710420] I [MSGID: 114046]
> [client-handshake.c:1106:client_setvolume_cbk] 0-patchy-client-1: Connected
> to patchy-client-1, attached to remote volume '/d/backends/patchy1'.
> [2019-04-11 08:22:23.791635] W [MSGID: 122035]
> [ec-common.c:654:ec_child_select] 0-patchy-disperse-0: Executing operation
> with some subvolumes unavailable. (C). FOP : 'INODELK' failed on '/test'
> with gfid 35743386-b7c2-41c9-aafd-6b13de216704
> [2019-04-11 08:22:24.460529] I [MSGID: 114018]
> [client.c:2333:client_rpc_notify] 0-patchy-client-1: disconnected from
> patchy-client-1. Client process will keep trying to connect to glusterd
> until brick's port is available
> [2019-04-11 08:22:24.628478]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t:
> TEST: 68 5 online_brick_count ++++++++++
> [2019-04-11 08:22:26.097092]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t:
> TEST: 70 rm -f 0.o 10.o 11.o 12.o 13.o 14.o 15.o 16.o 17.o 18.o 19.o 1.o
> 2.o 3.o 4.o 5.o 6.o 7.o 8.o 9.o ++++++++++
> [2019-04-11 08:22:26.333740]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t:
> TEST: 71 ec_test_make ++++++++++
> [2019-04-11 08:22:27.719299] I [MSGID: 114046]
> [client-handshake.c:1106:client_setvolume_cbk] 0-patchy-client-1: Connected
> to patchy-client-1, attached to remote volume '/d/backends/patchy1'.
> [2019-04-11 08:22:27.840342] W [MSGID: 122035]
> [ec-common.c:654:ec_child_select] 0-patchy-disperse-0: Executing operation
> with some subvolumes unavailable. (C). FOP : 'INODELK' failed on '/test'
> with gfid 35743386-b7c2-41c9-aafd-6b13de216704
>
> The problem happens for two things:
>
> 1. Brick 0 gets disconnected randomly (apparently), but the server side is
> not aware of these disconnections. This causes that at 08:22:24.460529, the
> client has already sent a successful INODELK request to brick 0. At this
> point the connection is broken on the client side, but server side doesn't
> get any notification, so it doesn't clear the locks.
>
2. When client reconnects at 08:22:27.719299, a new connection is created,
> and the servers does see this new connection (it creates a new client_t
> structure). Then the client sends the unlock request, which fails on brick
> 0 because locks xlators checks if the client is the same by comparing the
> pointers, but they are different because of the reconnection. So the lock
> is not unlocked and remains there, blocking all future inodelk requests.
>
> The first problem is why the client gets disconnected and the server
> doesn't get any notification. The script is stopping bricks 2 and 3 when
> this happens. Brick 0 shouldn't fail here. It seems related to the
>
> The second problem is that when we receive a new connection from a client
> we already consider connected, we don't cleanup the old connection, which
> should take care of the stale locks.
>
> The third problem is that locks xlator is checking if the client is the
> same by comparing pointers of client_t structs instead of comparing
> client_uid field, which remains the same.
>
> Adding +Raghavendra Gowdappa <[email protected]>, +Pranith Kumar
> Karampuri <[email protected]>, +Krutika Dhananjay <[email protected]>
> , +Shyam Ranganathan <[email protected]> and +Amar Tumballi
> <[email protected]> to help me identify why this is happening and
> what's the best way to solve it.
>

If server gets disconnect notification, then everything will be solved. I
think we need to find RC for that. Were you able to recreate it locally
even if it is happening once in a while that is fine.


>
> Xavi
>
>
>> Xavi
>>
>> ./tests/basic/uss.t             15  ==> happens in both brick mux and non
>>> brick mux runs, test just simply times out. Needs urgent analysis.
>>> ./tests/basic/ec/ec-fix-openfd.t 13  ==> Fixed through
>>> https://review.gluster.org/#/c/22508/ , patch merged today.
>>> ./tests/basic/volfile-sanity.t      8  ==> Some race, though this
>>> succeeds in second attempt every time.
>>>
>>> There're plenty more with 5 instances of failure from many tests. We
>>> need all maintainers/owners to look through these failures and fix them, we
>>> certainly don't want to get into a stage where master is unstable and we
>>> have to lock down the merges till all these failures are resolved. So
>>> please help.
>>>
>>> (Please note fstat stats show up the retries as failures too which in a
>>> way is right)
>>>
>>>
>>> On Tue, Feb 26, 2019 at 5:27 PM Atin Mukherjee <[email protected]>
>>> wrote:
>>>
>>>> [1] captures the test failures report since last 30 days and we'd need
>>>> volunteers/component owners to see why the number of failures are so high
>>>> against few tests.
>>>>
>>>> [1]
>>>> https://fstat.gluster.org/summary?start_date=2019-01-26&end_date=2019-02-25&job=all
>>>>
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> [email protected]
>>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>>

-- 
Pranith

_______________________________________________
Gluster-devel mailing list
[email protected]
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] test failure reports for last 15 days

Reply via email to