On Thu, Apr 11, 2019 at 2:59 PM Xavi Hernandez <[email protected]> wrote:
> On Wed, Apr 10, 2019 at 7:25 PM Xavi Hernandez <[email protected]> > wrote: > >> On Wed, Apr 10, 2019 at 4:01 PM Atin Mukherjee <[email protected]> >> wrote: >> >>> And now for last 15 days: >>> >>> >>> https://fstat.gluster.org/summary?start_date=2019-03-25&end_date=2019-04-10 >>> >>> ./tests/bitrot/bug-1373520.t 18 ==> Fixed through >>> https://review.gluster.org/#/c/glusterfs/+/22481/, I don't see this >>> failing in brick mux post 5th April >>> ./tests/bugs/ec/bug-1236065.t 17 ==> happens only in brick mux, >>> needs analysis. >>> >> >> I've identified the problem here, but not the cause yet. There's a stale >> inodelk acquired by a process that is already dead, which causes inodelk >> requests from self-heal and other processes to block. >> >> The reason why it seemed to block in random places is that all commands >> are executed with the working directory pointing to a gluster directory >> which needs healing after the initial tests. Because of the stale inodelk, >> when any application tries to open a file in the working directory, it's >> blocked. >> >> I'll investigate what causes this. >> > > I think I've found the problem. This is a fragment of the brick log that > includes script steps, connections and disconnections of brick 0, and lock > requests to the problematic lock: > > [2019-04-11 08:22:20.381398]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t: > TEST: 66 kill_brick patchy jahernan /d/backends/patchy2 ++++++++++ > [2019-04-11 08:22:22.532646]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t: > TEST: 67 kill_brick patchy jahernan /d/backends/patchy3 ++++++++++ > [2019-04-11 08:22:23.709655] I [MSGID: 115029] > [server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client > from > CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2 > (version: 7dev) with subvol /d/backends/patchy1 > [2019-04-11 08:22:23.792204] I [common.c:234:pl_trace_in] 8-patchy-locks: > [REQUEST] Locker = {Pid=29710, lk-owner=68580998b47f0000, > Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2, > Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil), > path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain: > patchy-disperse-0, start=0, len=0, pid=0} > [2019-04-11 08:22:23.792299] I [common.c:285:pl_trace_out] 8-patchy-locks: > [GRANTED] Locker = {Pid=29710, lk-owner=68580998b47f0000, > Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2, > Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil), > path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain: > patchy-disperse-0, start=0, len=0, pid=0} > [2019-04-11 08:22:24.628478]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t: > TEST: 68 5 online_brick_count ++++++++++ > [2019-04-11 08:22:26.097092]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t: > TEST: 70 rm -f 0.o 10.o 11.o 12.o 13.o 14.o 15.o 16.o 17.o 18.o 19.o 1.o > 2.o 3.o 4.o 5.o 6.o 7.o 8.o 9.o ++++++++++ > [2019-04-11 08:22:26.333740]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t: > TEST: 71 ec_test_make ++++++++++ > [2019-04-11 08:22:27.718963] I [MSGID: 115029] > [server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client > from > CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3 > (version: 7dev) with subvol /d/backends/patchy1 > [2019-04-11 08:22:27.801416] I [common.c:234:pl_trace_in] 8-patchy-locks: > [REQUEST] Locker = {Pid=29885, lk-owner=68580998b47f0000, > Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3, > Frame=19233} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil), > path=/test} Lock = {lock=INODELK, cmd=SETLK, type=UNLOCK, domain: > patchy-disperse-0, start=0, len=0, pid=0} > [2019-04-11 08:22:27.801434] E [inodelk.c:513:__inode_unlock_lock] > 8-patchy-locks: Matching lock not found for unlock 0-9223372036854775807, > by 68580998b47f0000 on 0x7f0ed0029190 > [2019-04-11 08:22:27.801446] I [common.c:285:pl_trace_out] 8-patchy-locks: > [Invalid argument] Locker = {Pid=29885, lk-owner=68580998b47f0000, > Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3, > Frame=19233} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil), > path=/test} Lock = {lock=INODELK, cmd=SETLK, type=UNLOCK, domain: > patchy-disperse-0, start=0, len=0, pid=0} > > This is a fragment of the client log: > > [2019-04-11 08:22:20.381398]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t: > TEST: 66 kill_brick patchy jahernan /d/backends/patchy2 ++++++++++ > [2019-04-11 08:22:20.675938] I [MSGID: 114018] > [client.c:2333:client_rpc_notify] 0-patchy-client-1: disconnected from > patchy-client-1. Client process will keep trying to connect to glusterd > until brick's port is available > [2019-04-11 08:22:21.674772] W [MSGID: 122035] > [ec-common.c:654:ec_child_select] 0-patchy-disperse-0: Executing operation > with some subvolumes unavailable. (6). FOP : 'INODELK' failed on '/test' > with gfid 35743386-b7c2-41c9-aafd-6b13de216704 > [2019-04-11 08:22:22.532646]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t: > TEST: 67 kill_brick patchy jahernan /d/backends/patchy3 ++++++++++ > [2019-04-11 08:22:23.691171] W [MSGID: 122035] > [ec-common.c:654:ec_child_select] 0-patchy-disperse-0: Executing operation > with some subvolumes unavailable. (8). FOP : 'INODELK' failed on '/test' > with gfid 35743386-b7c2-41c9-aafd-6b13de216704 > [2019-04-11 08:22:23.710420] I [MSGID: 114046] > [client-handshake.c:1106:client_setvolume_cbk] 0-patchy-client-1: Connected > to patchy-client-1, attached to remote volume '/d/backends/patchy1'. > [2019-04-11 08:22:23.791635] W [MSGID: 122035] > [ec-common.c:654:ec_child_select] 0-patchy-disperse-0: Executing operation > with some subvolumes unavailable. (C). FOP : 'INODELK' failed on '/test' > with gfid 35743386-b7c2-41c9-aafd-6b13de216704 > [2019-04-11 08:22:24.460529] I [MSGID: 114018] > [client.c:2333:client_rpc_notify] 0-patchy-client-1: disconnected from > patchy-client-1. Client process will keep trying to connect to glusterd > until brick's port is available > [2019-04-11 08:22:24.628478]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t: > TEST: 68 5 online_brick_count ++++++++++ > [2019-04-11 08:22:26.097092]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t: > TEST: 70 rm -f 0.o 10.o 11.o 12.o 13.o 14.o 15.o 16.o 17.o 18.o 19.o 1.o > 2.o 3.o 4.o 5.o 6.o 7.o 8.o 9.o ++++++++++ > [2019-04-11 08:22:26.333740]:++++++++++ G_LOG:tests/bugs/ec/bug-1236065.t: > TEST: 71 ec_test_make ++++++++++ > [2019-04-11 08:22:27.719299] I [MSGID: 114046] > [client-handshake.c:1106:client_setvolume_cbk] 0-patchy-client-1: Connected > to patchy-client-1, attached to remote volume '/d/backends/patchy1'. > [2019-04-11 08:22:27.840342] W [MSGID: 122035] > [ec-common.c:654:ec_child_select] 0-patchy-disperse-0: Executing operation > with some subvolumes unavailable. (C). FOP : 'INODELK' failed on '/test' > with gfid 35743386-b7c2-41c9-aafd-6b13de216704 > > The problem happens for two things: > > 1. Brick 0 gets disconnected randomly (apparently), but the server side is > not aware of these disconnections. This causes that at 08:22:24.460529, the > client has already sent a successful INODELK request to brick 0. At this > point the connection is broken on the client side, but server side doesn't > get any notification, so it doesn't clear the locks. > 2. When client reconnects at 08:22:27.719299, a new connection is created, > and the servers does see this new connection (it creates a new client_t > structure). Then the client sends the unlock request, which fails on brick > 0 because locks xlators checks if the client is the same by comparing the > pointers, but they are different because of the reconnection. So the lock > is not unlocked and remains there, blocking all future inodelk requests. > > The first problem is why the client gets disconnected and the server > doesn't get any notification. The script is stopping bricks 2 and 3 when > this happens. Brick 0 shouldn't fail here. It seems related to the > > The second problem is that when we receive a new connection from a client > we already consider connected, we don't cleanup the old connection, which > should take care of the stale locks. > > The third problem is that locks xlator is checking if the client is the > same by comparing pointers of client_t structs instead of comparing > client_uid field, which remains the same. > > Adding +Raghavendra Gowdappa <[email protected]>, +Pranith Kumar > Karampuri <[email protected]>, +Krutika Dhananjay <[email protected]> > , +Shyam Ranganathan <[email protected]> and +Amar Tumballi > <[email protected]> to help me identify why this is happening and > what's the best way to solve it. > If server gets disconnect notification, then everything will be solved. I think we need to find RC for that. Were you able to recreate it locally even if it is happening once in a while that is fine. > > Xavi > > >> Xavi >> >> ./tests/basic/uss.t 15 ==> happens in both brick mux and non >>> brick mux runs, test just simply times out. Needs urgent analysis. >>> ./tests/basic/ec/ec-fix-openfd.t 13 ==> Fixed through >>> https://review.gluster.org/#/c/22508/ , patch merged today. >>> ./tests/basic/volfile-sanity.t 8 ==> Some race, though this >>> succeeds in second attempt every time. >>> >>> There're plenty more with 5 instances of failure from many tests. We >>> need all maintainers/owners to look through these failures and fix them, we >>> certainly don't want to get into a stage where master is unstable and we >>> have to lock down the merges till all these failures are resolved. So >>> please help. >>> >>> (Please note fstat stats show up the retries as failures too which in a >>> way is right) >>> >>> >>> On Tue, Feb 26, 2019 at 5:27 PM Atin Mukherjee <[email protected]> >>> wrote: >>> >>>> [1] captures the test failures report since last 30 days and we'd need >>>> volunteers/component owners to see why the number of failures are so high >>>> against few tests. >>>> >>>> [1] >>>> https://fstat.gluster.org/summary?start_date=2019-01-26&end_date=2019-02-25&job=all >>>> >>> _______________________________________________ >>> Gluster-devel mailing list >>> [email protected] >>> https://lists.gluster.org/mailman/listinfo/gluster-devel >> >> -- Pranith
_______________________________________________ Gluster-devel mailing list [email protected] https://lists.gluster.org/mailman/listinfo/gluster-devel
