On 31 July 2018 at 22:11, Atin Mukherjee <[email protected]> wrote:
> I just went through the nightly regression report of brick mux runs and > here's what I can summarize. > > ============================================================ > ============================================================ > ================================================= > Fails only with brick-mux > ============================================================ > ============================================================ > ================================================= > tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even after > 400 secs. Refer https://fstat.gluster.org/failure/209?state=2&start_ > date=2018-06-30&end_date=2018-07-31&branch=all, specifically the latest > report https://build.gluster.org/job/regression-test-burn-in/4051/ > consoleText . Wasn't timing out as frequently as it was till 12 July. But > since 27 July, it has timed out twice. Beginning to believe commit > 9400b6f2c8aa219a493961e0ab9770b7f12e80d2 has added the delay and now 400 > secs isn't sufficient enough (Mohit?) > One of the failed regression-test-burn in was an actual failure,not a timeout. https://build.gluster.org/job/regression-test-burn-in/4049 The brick disconnects from glusterd: [2018-07-27 16:28:42.882668] I [MSGID: 106005] [glusterd-handler.c:6129:__glusterd_brick_rpc_notify] 0-management: Brick builder103.cloud.gluster.org:/d/backends/vol01/brick0 has disconnected from glusterd. [2018-07-27 16:28:42.891031] I [MSGID: 106143] [glusterd-pmap.c:397:pmap_registry_remove] 0*-pmap: removing brick /d/backends/vol01/brick0 on port 49152* [2018-07-27 16:28:42.892379] I [MSGID: 106143] [glusterd-pmap.c:397:pmap_registry_remove] 0-pmap: removing brick (null) on port 49152 [2018-07-27 16:29:02.636027]:++++++++++ G_LOG:./tests/bugs/core/bug-1432542-mpx-restart-crash.t: TEST: 56 _GFS --attribute-timeout=0 --entry-timeout=0 -s builder103.cloud.gluster.org --volfile-id=patchy-vol20 /mnt/glusterfs/vol20 ++++++++++ So the client cannot connect to the bricks after this as it never gets the port info from glusterd. From mnt-glusterfs-vol20.log: [2018-07-27 16:29:02.769947] I [MSGID: 114020] [client.c:2329:notify] 0-patchy-vol20-client-1: parent translators are ready, attempting connect on transport [2018-07-27 16:29:02.770677] E [MSGID: 114058] [client-handshake.c:1518:client_query_portmap_cbk] 0-patchy-vol20-client-0: *failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running*. [2018-07-27 16:29:02.770767] I [MSGID: 114018] [client.c:2255:client_rpc_notify] 0-patchy-vol20-client-0: disconnected from patchy-vol20-client-0. Client process will keep trying to connect to glusterd until brick's port is available >From the brick logs: [2018-07-27 16:28:34.729241] I [login.c:111:gf_auth] 0-auth/login: allowed user names: 2b65c380-392e-459f-b722-c130aac29377 [2018-07-27 16:28:34.945474] I [MSGID: 115029] [server-handshake.c:786:server_setvolume] 0-patchy-vol01-server: accepted client from CTX_ID:72dcd65e-2125-4a79-8331-48c0fe9abce7-GRAPH_ID:0-PID:8483-HOST:builder103.cloud.gluster.org-PC_NAME:patchy-vol06-client-2-RECON_NO:-0 (version: 4.2dev) [2018-07-27 16:28:35.946588] I [MSGID: 101016] [glusterfs3.h:739:dict_to_xdr] 0-dict: key 'glusterfs.xattrop_index_gfid' is would not be sent on wire in future [Invalid argument] * <--- Last Brick Log. It looks like the brick went down at this point.* [2018-07-27 16:29:02.636027]:++++++++++ G_LOG:./tests/bugs/core/bug-1432542-mpx-restart-crash.t: TEST: 56 _GFS --attribute-timeout=0 --entry-timeout=0 -s builder103.cloud.gluster.org --volfile-id=patchy-vol20 /mnt/glusterfs/vol20 ++++++++++ [2018-07-27 16:29:12.021827]:++++++++++ G_LOG:./tests/bugs/core/bug-1432542-mpx-restart-crash.t: TEST: 83 dd if=/dev/zero of=/mnt/glusterfs/vol20/a_file bs=4k count=1 ++++++++++ [2018-07-27 16:29:12.039248]:++++++++++ G_LOG:./tests/bugs/core/bug-1432542-mpx-restart-crash.t: TEST: 87 killall -9 glusterd ++++++++++ [2018-07-27 16:29:17.073995]:++++++++++ G_LOG:./tests/bugs/core/bug-1432542-mpx-restart-crash.t: TEST: 89 killall -9 glusterfsd ++++++++++ [2018-07-27 16:29:22.096385]:++++++++++ G_LOG:./tests/bugs/core/bug-1432542-mpx-restart-crash.t: TEST: 95 glusterd ++++++++++ [2018-07-27 16:29:24.481555] I [MSGID: 100030] [glusterfsd.c:2728:main] 0-/build/install/sbin/glusterfsd: Started running /build/install/sbin/glusterfsd version 4.2dev (args: /build/install/sbin/glusterfsd -s builder103.cloud.gluster.org --volfile-id patchy-vol01.builder103.cloud.gluster.org.d-backends-vol01-brick0 -p /var/run/gluster/vols/patchy-vol01/builder103.cloud.gluster.org-d-backends-vol01-brick0.pid -S /var/run/gluster/f4d6c8f7c3f85b18.socket --brick-name /d/backends/vol01/brick0 -l /var/log/glusterfs/bricks/d-backends-vol01-brick0.log --xlator-option *-posix.glusterd-uuid=0db25f79-8880-4f2d-b1e8-584e751ff0b9 --process-name brick --brick-port 49153 --xlator-option patchy-vol01-server.listen-port=49153) >From /var/log/messages: *Jul 27 16:28:42 builder103 kernel: [ 2902] 0 2902 3777638 200036 2322 0 0 glusterfsd* ... *Jul 27 16:28:42 builder103 kernel: Out of memory: Kill process 2902 (glusterfsd) score 418 or sacrifice child* *Jul 27 16:28:42 builder103 kernel: Killed process 2902 (glusterfsd) total-vm:15110552kB, anon-rss:800144kB, file-rss:0kB, shmem-rss:0kB* *Jul 27 16:30:01 builder103 systemd: Created slice User Slice of root. * Possible OOM kill? Regards, Nithya > tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t > (Ref - https://build.gluster.org/job/regression-test-with- > multiplex/814/console) - Test fails only in brick-mux mode, AI on Atin > to look at and get back. > > tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t ( > https://build.gluster.org/job/regression-test-with-multiplex/813/console) > - Seems like failed just twice in last 30 days as per > https://fstat.gluster.org/failure/251?state=2&start_ > date=2018-06-30&end_date=2018-07-31&branch=all. Need help from AFR team. > > tests/bugs/quota/bug-1293601.t (https://build.gluster.org/ > job/regression-test-with-multiplex/812/console) - Hasn't failed after 26 > July and earlier it was failing regularly. Did we fix this test through any > patch (Mohit?) > > tests/bitrot/bug-1373520.t - (https://build.gluster.org/ > job/regression-test-with-multiplex/811/console) - Hasn't failed after 27 > July and earlier it was failing regularly. Did we fix this test through any > patch (Mohit?) > > tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a core, > not sure if related to brick mux or not, so not sure if brick mux is > culprit here or not. Ref - https://build.gluster.org/job/ > regression-test-with-multiplex/806/console . Seems to be a glustershd > crash. Need help from AFR folks. > > ============================================================ > ============================================================ > ================================================= > Fails for non-brick mux case too > ============================================================ > ============================================================ > ================================================= > tests/bugs/distribute/bug-1122443.t 0 Seems to be failing at my setup > very often, with out brick mux as well. Refer > https://build.gluster.org/job/regression-test-burn-in/4050/consoleText . > There's an email in gluster-devel and a BZ 1610240 for the same. > > tests/bugs/bug-1368312.t - Seems to be recent failures ( > https://build.gluster.org/job/regression-test-with-multiplex/815/console) > - seems to be a new failure, however seen this for a non-brick-mux case too > - https://build.gluster.org/job/regression-test-burn-in/4039/consoleText > . Need some eyes from AFR folks. > > tests/00-geo-rep/georep-basic-dr-tarssh.t - this isn't specific to brick > mux, have seen this failing at multiple default regression runs. Refer > https://fstat.gluster.org/failure/392?state=2&start_ > date=2018-06-30&end_date=2018-07-31&branch=all . We need help from > geo-rep dev to root cause this earlier than later > > tests/00-geo-rep/georep-basic-dr-rsync.t - this isn't specific to brick > mux, have seen this failing at multiple default regression runs. Refer > https://fstat.gluster.org/failure/393?state=2&start_ > date=2018-06-30&end_date=2018-07-31&branch=all . We need help from > geo-rep dev to root cause this earlier than later > > tests/bugs/glusterd/validating-server-quorum.t (https://build.gluster.org/ > job/regression-test-with-multiplex/810/console) - Fails for non-brick-mux > cases too, https://fstat.gluster.org/failure/580?state=2&start_ > date=2018-06-30&end_date=2018-07-31&branch=all . Atin has a patch > https://review.gluster.org/20584 which resolves it but patch is failing > regression for a different test which is unrelated. > > tests/bugs/replicate/bug-1586020-mark-dirty-for-entry-txn-on-quorum-failure.t > (Ref - https://build.gluster.org/job/regression-test-with- > multiplex/809/console) - fails for non brick mux case too - > https://build.gluster.org/job/regression-test-burn-in/4049/consoleText - > Need some eyes from AFR folks. > > _______________________________________________ > maintainers mailing list > [email protected] > https://lists.gluster.org/mailman/listinfo/maintainers > >
_______________________________________________ maintainers mailing list [email protected] https://lists.gluster.org/mailman/listinfo/maintainers
