Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Pranith Kumar Karampuri
On Thu, Aug 2, 2018 at 10:03 PM Pranith Kumar Karampuri 
wrote:

>
>
> On Thu, Aug 2, 2018 at 7:19 PM Atin Mukherjee  wrote:
>
>> New addition - tests/basic/volume.t - failed twice atleast with shd core.
>>
>> One such ref -
>> https://build.gluster.org/job/centos7-regression/2058/console
>>
>
> I will take a look.
>

The crash is happening inside libc and there are no line numbers to debug
further. Is there anyway to get symbols, line numbers even for that? We can
find hints as to what could be going wrong. Let me try to re-create it on
the machines I have in the meanwhile.

(gdb) bt
#0  0x7feae916bb4f in _IO_cleanup () from ./lib64/libc.so.6
#1  0x7feae9127b8b in __run_exit_handlers () from ./lib64/libc.so.6
#2  0x7feae9127c27 in exit () from ./lib64/libc.so.6
#3  0x00408ba5 in cleanup_and_exit (signum=15) at
/home/jenkins/root/workspace/centos7-regression/glusterfsd/src/glusterfsd.c:1570
#4  0x0040a75f in glusterfs_sigwaiter (arg=0x7ffe6faa7540) at
/home/jenkins/root/workspace/centos7-regression/glusterfsd/src/glusterfsd.c:2332
#5  0x7feae9b27e25 in start_thread () from ./lib64/libpthread.so.0
#6  0x7feae91ecbad in clone () from ./lib64/libc.so.6


>
>>
>>
>> On Thu, Aug 2, 2018 at 6:28 PM Sankarshan Mukhopadhyay <
>> sankarshan.mukhopadh...@gmail.com> wrote:
>>
>>> On Thu, Aug 2, 2018 at 5:48 PM, Kotresh Hiremath Ravishankar
>>>  wrote:
>>> > I am facing different issue in softserve machines. The fuse mount
>>> itself is
>>> > failing.
>>> > I tried day before yesterday to debug geo-rep failures. I discussed
>>> with
>>> > Raghu,
>>> > but could not root cause it. So none of the tests were passing. It
>>> happened
>>> > on
>>> > both machine instances I tried.
>>> >
>>>
>>> Ugh! -infra team should have an issue to work with and resolve this.
>>>
>>>
>>> --
>>> sankarshan mukhopadhyay
>>> 
>>> ___
>>> Gluster-devel mailing list
>>> Gluster-devel@gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>>
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
>
>
> --
> Pranith
>


-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Release 5: Nightly test failures tracking

2018-08-02 Thread Shyam Ranganathan
On 07/24/2018 03:12 PM, Shyam Ranganathan wrote:
> 1) master branch health checks (weekly, till branching)
>   - Expect every Monday a status update on various tests runs

As we have quite a few jobs failing and quite a few tests failing, to
enable tracking this better I have created the sheet as in [1].

Atin and myself will keep this updated. If anyone is working on a test
case failure, add your name as a comment to the "Owner" cell, and if
there is a bug filed, do the same to the BZ# cell.

Newer failures or additions will be done to the sheet and in addtion
posted to this thread for contributors to pick up and analyze.

Current list of tests are as follows (which some of you are already
looking at),
./tests/bugs/core/bug-1432542-mpx-restart-crash.t
./tests/00-geo-rep/georep-basic-dr-tarssh.t
./tests/bugs/bug-1368312.t
./tests/bugs/distribute/bug-1122443.t
./tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
./tests/bugs/replicate/bug-1586020-mark-dirty-for-entry-txn-on-quorum-failure.t
./tests/bitrot/bug-1373520.t
./tests/bugs/ec/bug-1236065.t
./tests/00-geo-rep/georep-basic-dr-rsync.t
./tests/basic/ec/ec-1468261.t
./tests/bugs/glusterd/quorum-validation.t
./tests/bugs/quota/bug-1293601.t
./tests/basic/afr/add-brick-self-heal.t
./tests/basic/afr/granular-esh/replace-brick.t
./tests/bugs/core/multiplex-limit-issue-151.t
./tests/bugs/distribute/bug-1042725.t
./tests/bugs/distribute/bug-1117851.t
./tests/bugs/glusterd/rebalance-operations-in-single-node.t
./tests/bugs/index/bug-1559004-EMLINK-handling.t
./tests/bugs/replicate/bug-1386188-sbrain-fav-child.t
./tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t

Thanks.

[1] Test failures tracking:
https://docs.google.com/spreadsheets/d/1IF9GhpKah4bto19RQLr0y_Kkw26E_-crKALHSaSjZMQ/edit?usp=sharing
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Karthik Subrahmanya
On Tue 31 Jul, 2018, 10:17 PM Atin Mukherjee,  wrote:

> I just went through the nightly regression report of brick mux runs and
> here's what I can summarize.
>
>
> =
> Fails only with brick-mux
>
> =
> tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even after 400
> secs. Refer
> https://fstat.gluster.org/failure/209?state=2_date=2018-06-30_date=2018-07-31=all,
> specifically the latest report
> https://build.gluster.org/job/regression-test-burn-in/4051/consoleText .
> Wasn't timing out as frequently as it was till 12 July. But since 27 July,
> it has timed out twice. Beginning to believe commit
> 9400b6f2c8aa219a493961e0ab9770b7f12e80d2 has added the delay and now 400
> secs isn't sufficient enough (Mohit?)
>
> tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
> (Ref -
> https://build.gluster.org/job/regression-test-with-multiplex/814/console)
> -  Test fails only in brick-mux mode, AI on Atin to look at and get back.
>
> tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
> https://build.gluster.org/job/regression-test-with-multiplex/813/console)
> - Seems like failed just twice in last 30 days as per
> https://fstat.gluster.org/failure/251?state=2_date=2018-06-30_date=2018-07-31=all.
> Need help from AFR team.
>
> tests/bugs/quota/bug-1293601.t (
> https://build.gluster.org/job/regression-test-with-multiplex/812/console)
> - Hasn't failed after 26 July and earlier it was failing regularly. Did we
> fix this test through any patch (Mohit?)
>
> tests/bitrot/bug-1373520.t - (
> https://build.gluster.org/job/regression-test-with-multiplex/811/console)
> - Hasn't failed after 27 July and earlier it was failing regularly. Did we
> fix this test through any patch (Mohit?)
>
> tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a core,
> not sure if related to brick mux or not, so not sure if brick mux is
> culprit here or not. Ref -
> https://build.gluster.org/job/regression-test-with-multiplex/806/console
> . Seems to be a glustershd crash. Need help from AFR folks.
>
>
> =
> Fails for non-brick mux case too
>
> =
> tests/bugs/distribute/bug-1122443.t 0 Seems to be failing at my setup very
> often, with out brick mux as well. Refer
> https://build.gluster.org/job/regression-test-burn-in/4050/consoleText .
> There's an email in gluster-devel and a BZ 1610240 for the same.
>
> tests/bugs/bug-1368312.t - Seems to be recent failures (
> https://build.gluster.org/job/regression-test-with-multiplex/815/console)
> - seems to be a new failure, however seen this for a non-brick-mux case too
> - https://build.gluster.org/job/regression-test-burn-in/4039/consoleText
> . Need some eyes from AFR folks.
>
> tests/00-geo-rep/georep-basic-dr-tarssh.t - this isn't specific to brick
> mux, have seen this failing at multiple default regression runs. Refer
> https://fstat.gluster.org/failure/392?state=2_date=2018-06-30_date=2018-07-31=all
> . We need help from geo-rep dev to root cause this earlier than later
>
> tests/00-geo-rep/georep-basic-dr-rsync.t - this isn't specific to brick
> mux, have seen this failing at multiple default regression runs. Refer
> https://fstat.gluster.org/failure/393?state=2_date=2018-06-30_date=2018-07-31=all
> . We need help from geo-rep dev to root cause this earlier than later
>
> tests/bugs/glusterd/validating-server-quorum.t (
> https://build.gluster.org/job/regression-test-with-multiplex/810/console)
> - Fails for non-brick-mux cases too,
> https://fstat.gluster.org/failure/580?state=2_date=2018-06-30_date=2018-07-31=all
> .  Atin has a patch https://review.gluster.org/20584 which resolves it
> but patch is failing regression for a different test which is unrelated.
>
> tests/bugs/replicate/bug-1586020-mark-dirty-for-entry-txn-on-quorum-failure.t
> (Ref -
> https://build.gluster.org/job/regression-test-with-multiplex/809/console)
> - fails for non brick mux case too -
> https://build.gluster.org/job/regression-test-burn-in/4049/consoleText -
> Need some eyes from AFR folks.
>
I am looking at this. It is not reproducible locally. Trying to do this on
soft serve.

Regards,
Karthik

> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Kotresh Hiremath Ravishankar
Have attached in the Bug https://bugzilla.redhat.com/show_bug.cgi?id=1611635


On Thu, 2 Aug 2018, 22:21 Raghavendra Gowdappa,  wrote:

>
>
> On Thu, Aug 2, 2018 at 5:48 PM, Kotresh Hiremath Ravishankar <
> khire...@redhat.com> wrote:
>
>> I am facing different issue in softserve machines. The fuse mount itself
>> is failing.
>> I tried day before yesterday to debug geo-rep failures. I discussed with
>> Raghu,
>> but could not root cause it.
>>
>
> Where can I find the complete client logs for this?
>
> So none of the tests were passing. It happened on
>> both machine instances I tried.
>>
>> 
>> [2018-07-31 10:41:49.288117] D [fuse-bridge.c:5407:notify] 0-fuse: got
>> event 6 on graph 0
>> [2018-07-31 10:41:49.289427] D [fuse-bridge.c:4990:fuse_get_mount_status]
>> 0-fuse: mount status is 0
>> [2018-07-31 10:41:49.289555] D [fuse-bridge.c:4256:fuse_init]
>> 0-glusterfs-fuse: Detected support for FUSE_AUTO_INVAL_DATA. Enabling
>> fopen_keep_cache automatically.
>> [2018-07-31 10:41:49.289591] T [fuse-bridge.c:278:send_fuse_iov]
>> 0-glusterfs-fuse: writev() result 40/40
>> [2018-07-31 10:41:49.289610] I [fuse-bridge.c:4314:fuse_init]
>> 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
>> 7.22
>> [2018-07-31 10:41:49.289627] I [fuse-bridge.c:4948:fuse_graph_sync]
>> 0-fuse: switched to graph 0
>> [2018-07-31 10:41:49.289696] T [MSGID: 0] [syncop.c:1261:syncop_lookup]
>> 0-stack-trace: stack-address: 0x7f36e4001058, winding from fuse to
>> meta-autoload
>> [2018-07-31 10:41:49.289743] T [MSGID: 0]
>> [defaults.c:2716:default_lookup] 0-stack-trace: stack-address:
>> 0x7f36e4001058, winding from meta-autoload to master
>> [2018-07-31 10:41:49.289787] T [MSGID: 0]
>> [io-stats.c:2788:io_stats_lookup] 0-stack-trace: stack-address:
>> 0x7f36e4001058, winding from master to master-md-cache
>> [2018-07-31 10:41:49.289833] T [MSGID: 0]
>> [md-cache.c:513:mdc_inode_iatt_get] 0-md-cache: mdc_inode_ctx_get failed
>> (----0001)
>> [2018-07-31 10:41:49.289923] T [MSGID: 0] [md-cache.c:1200:mdc_lookup]
>> 0-stack-trace: stack-address: 0x7f36e4001058, winding from master-md-cache
>> to master-open-behind
>> [2018-07-31 10:41:49.289946] T [MSGID: 0]
>> [defaults.c:2716:default_lookup] 0-stack-trace: stack-address:
>> 0x7f36e4001058, winding from master-open-behind to master-quick-read
>> [2018-07-31 10:41:49.289973] T [MSGID: 0] [quick-read.c:556:qr_lookup]
>> 0-stack-trace: stack-address: 0x7f36e4001058, winding from
>> master-quick-read to master-io-cache
>> [2018-07-31 10:41:49.290002] T [MSGID: 0] [io-cache.c:298:ioc_lookup]
>> 0-stack-trace: stack-address: 0x7f36e4001058, winding from master-io-cache
>> to master-readdir-ahead
>> [2018-07-31 10:41:49.290034] T [MSGID: 0]
>> [defaults.c:2716:default_lookup] 0-stack-trace: stack-address:
>> 0x7f36e4001058, winding from master-readdir-ahead to master-read-ahead
>> [2018-07-31 10:41:49.290052] T [MSGID: 0]
>> [defaults.c:2716:default_lookup] 0-stack-trace: stack-address:
>> 0x7f36e4001058, winding from master-read-ahead to master-write-behind
>> [2018-07-31 10:41:49.290077] T [MSGID: 0] [write-behind.c:2439:wb_lookup]
>> 0-stack-trace: stack-address: 0x7f36e4001058, winding from
>> master-write-behind to master-dht
>> [2018-07-31 10:41:49.290156] D [MSGID: 0]
>> [dht-common.c:3674:dht_do_fresh_lookup] 0-master-dht: /: no subvolume in
>> layout for path, checking on all the subvols to see if it is a directory
>> [2018-07-31 10:41:49.290180] D [MSGID: 0]
>> [dht-common.c:3688:dht_do_fresh_lookup] 0-master-dht: /: Found null hashed
>> subvol. Calling lookup on all nodes.
>> [2018-07-31 10:41:49.290199] T [MSGID: 0]
>> [dht-common.c:3695:dht_do_fresh_lookup] 0-stack-trace: stack-address:
>> 0x7f36e4001058, winding from master-dht to master-replicate-0
>> [2018-07-31 10:41:49.290245] I [MSGID: 108006]
>> [afr-common.c:5582:afr_local_init] 0-master-replicate-0: no subvolumes up
>> [2018-07-31 10:41:49.290291] D [MSGID: 0]
>> [afr-common.c:3212:afr_discover] 0-stack-trace: stack-address:
>> 0x7f36e4001058, master-replicate-0 returned -1 error: Transport endpoint is
>> not conne
>> cted [Transport endpoint is not connected]
>> [2018-07-31 10:41:49.290323] D [MSGID: 0]
>> [dht-common.c:1391:dht_lookup_dir_cbk] 0-master-dht: lookup of / on
>> master-replicate-0 returned error [Transport endpoint is not connected]
>> [2018-07-31 10:41:49.290350] T [MSGID: 0]
>> [dht-common.c:3695:dht_do_fresh_lookup] 0-stack-trace: stack-address:
>> 0x7f36e4001058, winding from master-dht to master-replicate-1
>> [2018-07-31 10:41:49.290381] I [MSGID: 108006]
>> [afr-common.c:5582:afr_local_init] 0-master-replicate-1: no subvolumes up
>> [2018-07-31 10:41:49.290403] D [MSGID: 0]
>> [afr-common.c:3212:afr_discover] 0-stack-trace: stack-address:
>> 0x7f36e4001058, master-replicate-1 returned -1 error: Transport endpoint is
>> not connected [Transport endpoint is not connected]
>> [2018-07-31 10:41:49.290427] D [MSGID: 0]
>> 

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Raghavendra Gowdappa
On Thu, Aug 2, 2018 at 5:48 PM, Kotresh Hiremath Ravishankar <
khire...@redhat.com> wrote:

> I am facing different issue in softserve machines. The fuse mount itself
> is failing.
> I tried day before yesterday to debug geo-rep failures. I discussed with
> Raghu,
> but could not root cause it.
>

Where can I find the complete client logs for this?

So none of the tests were passing. It happened on
> both machine instances I tried.
>
> 
> [2018-07-31 10:41:49.288117] D [fuse-bridge.c:5407:notify] 0-fuse: got
> event 6 on graph 0
> [2018-07-31 10:41:49.289427] D [fuse-bridge.c:4990:fuse_get_mount_status]
> 0-fuse: mount status is 0
> [2018-07-31 10:41:49.289555] D [fuse-bridge.c:4256:fuse_init]
> 0-glusterfs-fuse: Detected support for FUSE_AUTO_INVAL_DATA. Enabling
> fopen_keep_cache automatically.
> [2018-07-31 10:41:49.289591] T [fuse-bridge.c:278:send_fuse_iov]
> 0-glusterfs-fuse: writev() result 40/40
> [2018-07-31 10:41:49.289610] I [fuse-bridge.c:4314:fuse_init]
> 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
> 7.22
> [2018-07-31 10:41:49.289627] I [fuse-bridge.c:4948:fuse_graph_sync]
> 0-fuse: switched to graph 0
> [2018-07-31 10:41:49.289696] T [MSGID: 0] [syncop.c:1261:syncop_lookup]
> 0-stack-trace: stack-address: 0x7f36e4001058, winding from fuse to
> meta-autoload
> [2018-07-31 10:41:49.289743] T [MSGID: 0] [defaults.c:2716:default_lookup]
> 0-stack-trace: stack-address: 0x7f36e4001058, winding from meta-autoload to
> master
> [2018-07-31 10:41:49.289787] T [MSGID: 0] [io-stats.c:2788:io_stats_lookup]
> 0-stack-trace: stack-address: 0x7f36e4001058, winding from master to
> master-md-cache
> [2018-07-31 10:41:49.289833] T [MSGID: 0] [md-cache.c:513:mdc_inode_iatt_get]
> 0-md-cache: mdc_inode_ctx_get failed (----
> 0001)
> [2018-07-31 10:41:49.289923] T [MSGID: 0] [md-cache.c:1200:mdc_lookup]
> 0-stack-trace: stack-address: 0x7f36e4001058, winding from master-md-cache
> to master-open-behind
> [2018-07-31 10:41:49.289946] T [MSGID: 0] [defaults.c:2716:default_lookup]
> 0-stack-trace: stack-address: 0x7f36e4001058, winding from
> master-open-behind to master-quick-read
> [2018-07-31 10:41:49.289973] T [MSGID: 0] [quick-read.c:556:qr_lookup]
> 0-stack-trace: stack-address: 0x7f36e4001058, winding from
> master-quick-read to master-io-cache
> [2018-07-31 10:41:49.290002] T [MSGID: 0] [io-cache.c:298:ioc_lookup]
> 0-stack-trace: stack-address: 0x7f36e4001058, winding from master-io-cache
> to master-readdir-ahead
> [2018-07-31 10:41:49.290034] T [MSGID: 0] [defaults.c:2716:default_lookup]
> 0-stack-trace: stack-address: 0x7f36e4001058, winding from
> master-readdir-ahead to master-read-ahead
> [2018-07-31 10:41:49.290052] T [MSGID: 0] [defaults.c:2716:default_lookup]
> 0-stack-trace: stack-address: 0x7f36e4001058, winding from
> master-read-ahead to master-write-behind
> [2018-07-31 10:41:49.290077] T [MSGID: 0] [write-behind.c:2439:wb_lookup]
> 0-stack-trace: stack-address: 0x7f36e4001058, winding from
> master-write-behind to master-dht
> [2018-07-31 10:41:49.290156] D [MSGID: 0] 
> [dht-common.c:3674:dht_do_fresh_lookup]
> 0-master-dht: /: no subvolume in layout for path, checking on all the
> subvols to see if it is a directory
> [2018-07-31 10:41:49.290180] D [MSGID: 0] 
> [dht-common.c:3688:dht_do_fresh_lookup]
> 0-master-dht: /: Found null hashed subvol. Calling lookup on all nodes.
> [2018-07-31 10:41:49.290199] T [MSGID: 0] 
> [dht-common.c:3695:dht_do_fresh_lookup]
> 0-stack-trace: stack-address: 0x7f36e4001058, winding from master-dht to
> master-replicate-0
> [2018-07-31 10:41:49.290245] I [MSGID: 108006]
> [afr-common.c:5582:afr_local_init] 0-master-replicate-0: no subvolumes up
> [2018-07-31 10:41:49.290291] D [MSGID: 0] [afr-common.c:3212:afr_discover]
> 0-stack-trace: stack-address: 0x7f36e4001058, master-replicate-0 returned
> -1 error: Transport endpoint is not conne
> cted [Transport endpoint is not connected]
> [2018-07-31 10:41:49.290323] D [MSGID: 0] 
> [dht-common.c:1391:dht_lookup_dir_cbk]
> 0-master-dht: lookup of / on master-replicate-0 returned error [Transport
> endpoint is not connected]
> [2018-07-31 10:41:49.290350] T [MSGID: 0] 
> [dht-common.c:3695:dht_do_fresh_lookup]
> 0-stack-trace: stack-address: 0x7f36e4001058, winding from master-dht to
> master-replicate-1
> [2018-07-31 10:41:49.290381] I [MSGID: 108006]
> [afr-common.c:5582:afr_local_init] 0-master-replicate-1: no subvolumes up
> [2018-07-31 10:41:49.290403] D [MSGID: 0] [afr-common.c:3212:afr_discover]
> 0-stack-trace: stack-address: 0x7f36e4001058, master-replicate-1 returned
> -1 error: Transport endpoint is not connected [Transport endpoint is not
> connected]
> [2018-07-31 10:41:49.290427] D [MSGID: 0] 
> [dht-common.c:1391:dht_lookup_dir_cbk]
> 0-master-dht: lookup of / on master-replicate-1 returned error [Transport
> endpoint is not connected]
> [2018-07-31 10:41:49.290452] D [MSGID: 0] 
> 

Re: [Gluster-devel] bug-1432542-mpx-restart-crash.t failures

2018-08-02 Thread Shyam Ranganathan
On 08/01/2018 11:10 PM, Nigel Babu wrote:
> Hi Shyam,
> 
> Amar and I sat down to debug this failure[1] this morning. There was a
> bit of fun looking at the logs. It looked like the test restarted
> itself. The first log entry is at 16:20:03. This test has a timeout of
> 400 seconds which is around 16:26:43.
> 
> However, if you account for the fact that we log from the second step or
> so, it looks like the test timed out and we restarted it. The first log
> entry is from a few steps in, this makes sense. I think your patch[2] to
> increase the timeout to 800 seconds is the right way forward.
> 
> The last step before the timeout is this
> [2018-07-30 16:26:29.160943]  : volume stop patchy-vol17 : SUCCESS
> [2018-07-30 16:26:40.222688]  : volume delete patchy-vol17 : SUCCESS
> 
> There are 20 volumes, so it really needs at least a 90 second bump. I'm
> estimating 30 seconds per volume to clean up. You probably want to some
> extra time so it passes on lcov as well. So right now the 800 second
> clean up looks good.

Unfortunately the timeout bump still does not clear lcov, see,
https://build.gluster.org/job/line-coverage/401/console
https://build.gluster.org/job/line-coverage/400/console
https://build.gluster.org/job/line-coverage/406/console

The first test passes, then as a part of the full run it fails again.

Patch also pushes up the EXPECT_WITHIN to 120 seconds... :(

> 
> [1]: https://build.gluster.org/job/regression-test-burn-in/4051/
> [2]: https://review.gluster.org/#/c/20568/2
> -- 
> nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Pranith Kumar Karampuri
On Thu, Aug 2, 2018 at 7:19 PM Atin Mukherjee  wrote:

> New addition - tests/basic/volume.t - failed twice atleast with shd core.
>
> One such ref -
> https://build.gluster.org/job/centos7-regression/2058/console
>

I will take a look.


>
>
> On Thu, Aug 2, 2018 at 6:28 PM Sankarshan Mukhopadhyay <
> sankarshan.mukhopadh...@gmail.com> wrote:
>
>> On Thu, Aug 2, 2018 at 5:48 PM, Kotresh Hiremath Ravishankar
>>  wrote:
>> > I am facing different issue in softserve machines. The fuse mount
>> itself is
>> > failing.
>> > I tried day before yesterday to debug geo-rep failures. I discussed with
>> > Raghu,
>> > but could not root cause it. So none of the tests were passing. It
>> happened
>> > on
>> > both machine instances I tried.
>> >
>>
>> Ugh! -infra team should have an issue to work with and resolve this.
>>
>>
>> --
>> sankarshan mukhopadhyay
>> 
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel



-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Atin Mukherjee
New addition - tests/basic/volume.t - failed twice atleast with shd core.

One such ref - https://build.gluster.org/job/centos7-regression/2058/console


On Thu, Aug 2, 2018 at 6:28 PM Sankarshan Mukhopadhyay <
sankarshan.mukhopadh...@gmail.com> wrote:

> On Thu, Aug 2, 2018 at 5:48 PM, Kotresh Hiremath Ravishankar
>  wrote:
> > I am facing different issue in softserve machines. The fuse mount itself
> is
> > failing.
> > I tried day before yesterday to debug geo-rep failures. I discussed with
> > Raghu,
> > but could not root cause it. So none of the tests were passing. It
> happened
> > on
> > both machine instances I tried.
> >
>
> Ugh! -infra team should have an issue to work with and resolve this.
>
>
> --
> sankarshan mukhopadhyay
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Coverity covscan for 2018-08-02-47cbe34d (master branch)

2018-08-02 Thread staticanalysis


GlusterFS Coverity covscan results for the master branch are available from
http://download.gluster.org/pub/gluster/glusterfs/static-analysis/master/glusterfs-coverity/2018-08-02-47cbe34d/

Coverity covscan results for other active branches are also available at
http://download.gluster.org/pub/gluster/glusterfs/static-analysis/

___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Sankarshan Mukhopadhyay
On Thu, Aug 2, 2018 at 5:48 PM, Kotresh Hiremath Ravishankar
 wrote:
> I am facing different issue in softserve machines. The fuse mount itself is
> failing.
> I tried day before yesterday to debug geo-rep failures. I discussed with
> Raghu,
> but could not root cause it. So none of the tests were passing. It happened
> on
> both machine instances I tried.
>

Ugh! -infra team should have an issue to work with and resolve this.


-- 
sankarshan mukhopadhyay

___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Xavi Hernandez
On Thu, Aug 2, 2018 at 1:42 PM Kotresh Hiremath Ravishankar <
khire...@redhat.com> wrote:

>
>
> On Thu, Aug 2, 2018 at 5:05 PM, Atin Mukherjee  > wrote:
>
>>
>>
>> On Thu, Aug 2, 2018 at 4:37 PM Kotresh Hiremath Ravishankar <
>> khire...@redhat.com> wrote:
>>
>>>
>>>
>>> On Thu, Aug 2, 2018 at 3:49 PM, Xavi Hernandez 
>>> wrote:
>>>
 On Thu, Aug 2, 2018 at 6:14 AM Atin Mukherjee 
 wrote:

>
>
> On Tue, Jul 31, 2018 at 10:11 PM Atin Mukherjee 
> wrote:
>
>> I just went through the nightly regression report of brick mux runs
>> and here's what I can summarize.
>>
>>
>> =
>> Fails only with brick-mux
>>
>> =
>> tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even
>> after 400 secs. Refer
>> https://fstat.gluster.org/failure/209?state=2_date=2018-06-30_date=2018-07-31=all,
>> specifically the latest report
>> https://build.gluster.org/job/regression-test-burn-in/4051/consoleText
>> . Wasn't timing out as frequently as it was till 12 July. But since 27
>> July, it has timed out twice. Beginning to believe commit
>> 9400b6f2c8aa219a493961e0ab9770b7f12e80d2 has added the delay and now 400
>> secs isn't sufficient enough (Mohit?)
>>
>> tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
>> (Ref -
>> https://build.gluster.org/job/regression-test-with-multiplex/814/console)
>> -  Test fails only in brick-mux mode, AI on Atin to look at and get back.
>>
>> tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
>> https://build.gluster.org/job/regression-test-with-multiplex/813/console)
>> - Seems like failed just twice in last 30 days as per
>> https://fstat.gluster.org/failure/251?state=2_date=2018-06-30_date=2018-07-31=all.
>> Need help from AFR team.
>>
>> tests/bugs/quota/bug-1293601.t (
>> https://build.gluster.org/job/regression-test-with-multiplex/812/console)
>> - Hasn't failed after 26 July and earlier it was failing regularly. Did 
>> we
>> fix this test through any patch (Mohit?)
>>
>> tests/bitrot/bug-1373520.t - (
>> https://build.gluster.org/job/regression-test-with-multiplex/811/console)
>> - Hasn't failed after 27 July and earlier it was failing regularly. Did 
>> we
>> fix this test through any patch (Mohit?)
>>
>
> I see this has failed in day before yesterday's regression run as well
> (and I could reproduce it locally with brick mux enabled). The test fails
> in healing a file within a particular time period.
>
> *15:55:19* not ok 25 Got "0" instead of "512", LINENUM:55*15:55:19* 
> FAILED COMMAND: 512 path_size /d/backends/patchy5/FILE1
>
> Need EC dev's help here.
>

 I'm not sure where the problem is exactly. I've seen that when the test
 fails, self-heal is attempting to heal the file, but when the file is
 accessed, an Input/Output error is returned, aborting heal. I've checked
 that a heal is attempted every time the file is accessed, but it fails
 always. This error seems to come from bit-rot stub xlator.

 When in this situation, if I stop and start the volume, self-heal
 immediately heals the files. It seems like an stale state that is kept by
 the stub xlator, preventing the file from being healed.

 Adding bit-rot maintainers for help on this one.

>>>
>>> Bitrot-stub marks the file as corrupted in inode_ctx. But when the file
>>> and it's hardlink are deleted from that brick and a lookup is done
>>> on the file, it cleans up the marker on getting ENOENT. This is part of
>>> recovery steps, and only md-cache is disabled during the process.
>>> Is there any other perf xlators that needs to be disabled for this
>>> scenario to expect a lookup/revalidate on the brick where
>>> the back end file is deleted?
>>>
>>
>> But the same test doesn't fail with brick multiplexing not enabled. Do we
>> know why?
>>
> Don't know, something to do with perf xlators I suppose. It's not
> repdroduced on my local system with brick-mux enabled as well. But it's
> happening on Xavis' system.
>
> Xavi,
> Could you try with the patch [1] and let me know whether it fixes the
> issue.
>

With the additional performance xlators disabled still happens.

The only thing that I've observed is that if I add a sleep just before
stopping the volume, the test seems to pass always. Maybe there are some
background updates going on ? (ec does background updates, but I'm not sure
how this can be related with the Input/Output error accessing the brick
file).

Xavi


> 

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Kotresh Hiremath Ravishankar
I am facing different issue in softserve machines. The fuse mount itself is
failing.
I tried day before yesterday to debug geo-rep failures. I discussed with
Raghu,
but could not root cause it. So none of the tests were passing. It happened
on
both machine instances I tried.


[2018-07-31 10:41:49.288117] D [fuse-bridge.c:5407:notify] 0-fuse: got
event 6 on graph 0
[2018-07-31 10:41:49.289427] D [fuse-bridge.c:4990:fuse_get_mount_status]
0-fuse: mount status is 0
[2018-07-31 10:41:49.289555] D [fuse-bridge.c:4256:fuse_init]
0-glusterfs-fuse: Detected support for FUSE_AUTO_INVAL_DATA. Enabling
fopen_keep_cache automatically.
[2018-07-31 10:41:49.289591] T [fuse-bridge.c:278:send_fuse_iov]
0-glusterfs-fuse: writev() result 40/40
[2018-07-31 10:41:49.289610] I [fuse-bridge.c:4314:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
7.22
[2018-07-31 10:41:49.289627] I [fuse-bridge.c:4948:fuse_graph_sync] 0-fuse:
switched to graph 0
[2018-07-31 10:41:49.289696] T [MSGID: 0] [syncop.c:1261:syncop_lookup]
0-stack-trace: stack-address: 0x7f36e4001058, winding from fuse to
meta-autoload
[2018-07-31 10:41:49.289743] T [MSGID: 0] [defaults.c:2716:default_lookup]
0-stack-trace: stack-address: 0x7f36e4001058, winding from meta-autoload to
master
[2018-07-31 10:41:49.289787] T [MSGID: 0] [io-stats.c:2788:io_stats_lookup]
0-stack-trace: stack-address: 0x7f36e4001058, winding from master to
master-md-cache
[2018-07-31 10:41:49.289833] T [MSGID: 0]
[md-cache.c:513:mdc_inode_iatt_get] 0-md-cache: mdc_inode_ctx_get failed
(----0001)
[2018-07-31 10:41:49.289923] T [MSGID: 0] [md-cache.c:1200:mdc_lookup]
0-stack-trace: stack-address: 0x7f36e4001058, winding from master-md-cache
to master-open-behind
[2018-07-31 10:41:49.289946] T [MSGID: 0] [defaults.c:2716:default_lookup]
0-stack-trace: stack-address: 0x7f36e4001058, winding from
master-open-behind to master-quick-read
[2018-07-31 10:41:49.289973] T [MSGID: 0] [quick-read.c:556:qr_lookup]
0-stack-trace: stack-address: 0x7f36e4001058, winding from
master-quick-read to master-io-cache
[2018-07-31 10:41:49.290002] T [MSGID: 0] [io-cache.c:298:ioc_lookup]
0-stack-trace: stack-address: 0x7f36e4001058, winding from master-io-cache
to master-readdir-ahead
[2018-07-31 10:41:49.290034] T [MSGID: 0] [defaults.c:2716:default_lookup]
0-stack-trace: stack-address: 0x7f36e4001058, winding from
master-readdir-ahead to master-read-ahead
[2018-07-31 10:41:49.290052] T [MSGID: 0] [defaults.c:2716:default_lookup]
0-stack-trace: stack-address: 0x7f36e4001058, winding from
master-read-ahead to master-write-behind
[2018-07-31 10:41:49.290077] T [MSGID: 0] [write-behind.c:2439:wb_lookup]
0-stack-trace: stack-address: 0x7f36e4001058, winding from
master-write-behind to master-dht
[2018-07-31 10:41:49.290156] D [MSGID: 0]
[dht-common.c:3674:dht_do_fresh_lookup] 0-master-dht: /: no subvolume in
layout for path, checking on all the subvols to see if it is a directory
[2018-07-31 10:41:49.290180] D [MSGID: 0]
[dht-common.c:3688:dht_do_fresh_lookup] 0-master-dht: /: Found null hashed
subvol. Calling lookup on all nodes.
[2018-07-31 10:41:49.290199] T [MSGID: 0]
[dht-common.c:3695:dht_do_fresh_lookup] 0-stack-trace: stack-address:
0x7f36e4001058, winding from master-dht to master-replicate-0
[2018-07-31 10:41:49.290245] I [MSGID: 108006]
[afr-common.c:5582:afr_local_init] 0-master-replicate-0: no subvolumes up
[2018-07-31 10:41:49.290291] D [MSGID: 0] [afr-common.c:3212:afr_discover]
0-stack-trace: stack-address: 0x7f36e4001058, master-replicate-0 returned
-1 error: Transport endpoint is not conne
cted [Transport endpoint is not connected]
[2018-07-31 10:41:49.290323] D [MSGID: 0]
[dht-common.c:1391:dht_lookup_dir_cbk] 0-master-dht: lookup of / on
master-replicate-0 returned error [Transport endpoint is not connected]
[2018-07-31 10:41:49.290350] T [MSGID: 0]
[dht-common.c:3695:dht_do_fresh_lookup] 0-stack-trace: stack-address:
0x7f36e4001058, winding from master-dht to master-replicate-1
[2018-07-31 10:41:49.290381] I [MSGID: 108006]
[afr-common.c:5582:afr_local_init] 0-master-replicate-1: no subvolumes up
[2018-07-31 10:41:49.290403] D [MSGID: 0] [afr-common.c:3212:afr_discover]
0-stack-trace: stack-address: 0x7f36e4001058, master-replicate-1 returned
-1 error: Transport endpoint is not connected [Transport endpoint is not
connected]
[2018-07-31 10:41:49.290427] D [MSGID: 0]
[dht-common.c:1391:dht_lookup_dir_cbk] 0-master-dht: lookup of / on
master-replicate-1 returned error [Transport endpoint is not connected]
[2018-07-31 10:41:49.290452] D [MSGID: 0]
[dht-common.c:1574:dht_lookup_dir_cbk] 0-stack-trace: stack-address:
0x7f36e4001058, master-dht returned -1 error: Transport endpoint is not
connected [Transport endpoint is not connected]
[2018-07-31 10:41:49.290477] D [MSGID: 0]
[write-behind.c:2393:wb_lookup_cbk] 0-stack-trace: stack-address:
0x7f36e4001058, master-write-behind returned -1 error: Transport endpoint

Re: [Gluster-devel] ./tests/bugs/snapshot/bug-1167580-set-proper-uid-and-gid-during-nfs-access.t fails if non-anonymous fds are used in read path

2018-08-02 Thread Raghavendra Gowdappa
On Thu, Aug 2, 2018 at 3:54 PM, Rafi Kavungal Chundattu Parambil <
rkavu...@redhat.com> wrote:

> Yes, I think we can mark the test as bad for now. We found two issues that
> cause the failures.
>
> One issue is with the usage of anonymous fd from a fuse mount. posix acl
> which sits on the brick graph does the authentication check during open.
> But with anonymous FD's we may not have an explicit open received before
> let's a read fop. As a result, posix acl is not getting honoured with
> anonymous fd.
>
> The second issue is with snapd and libgfapi where it uses libgfapi to get
> the information from snapshot bricks. But uid, and gid's received from a
> client are not passed through libgfapi.
>
> I will fail two separate bugs to track this issue.
>
> Since both of this issues are not relevant to the fix which Raghavendra
> send, I agree to mark the tests as bad.
>

Thanks Rafi.


>
> Regards
> Rafi KC
>
>
> - Original Message -
> From: "Raghavendra Gowdappa" 
> To: "Sunny Kumar" , "Rafi" 
> Cc: "Gluster Devel" 
> Sent: Thursday, August 2, 2018 3:23:00 PM
> Subject: Re: 
> ./tests/bugs/snapshot/bug-1167580-set-proper-uid-and-gid-during-nfs-access.t
> fails if non-anonymous fds are used in read path
>
> I've filed  a bug to track this failure:
> https://bugzilla.redhat.com/show_bug.cgi?id=1611532
>
> As a stop gap measure I propose to mark the test as Bad to unblock patches
> [1][2]. Are maintainers of snapshot in agreement with this?
>
> regards,
> Raghavendra
>
> On Wed, Aug 1, 2018 at 10:28 AM, Raghavendra Gowdappa  >
> wrote:
>
> > Sunny/Rafi,
> >
> > I was trying to debug regression failures on [1]. Note that patch [1]
> only
> > disables usage of anonymous fds on readv. So, I tried the same test
> > disabling performance.open-behind
> >
> > [root@rhs-client27 glusterfs]# git diff
> > diff --git a/tests/bugs/snapshot/bug-1167580-set-proper-uid-and-
> gid-during-nfs-access.t
> > b/tests/bugs/snapshot/bug-1167580-set-proper-uid-and-
> > gid-during-nfs-access.t
> > index 3776451..cedf96b 100644
> > --- a/tests/bugs/snapshot/bug-1167580-set-proper-uid-and-
> > gid-during-nfs-access.t
> > +++ b/tests/bugs/snapshot/bug-1167580-set-proper-uid-and-
> > gid-during-nfs-access.t
> > @@ -79,6 +79,7 @@ TEST $CLI volume start $V0
> >  EXPECT_WITHIN $NFS_EXPORT_TIMEOUT "1" is_nfs_export_available
> >  TEST glusterfs -s $H0 --volfile-id $V0 $M0
> >  TEST mount_nfs $H0:/$V0 $N0 nolock
> > +TEST $CLI volume set $V0 performance.open-behind off
> >
> >  # Create 2 user
> >  user1=$(get_new_user)
> >
> >
> > With the above change, I can see consistent failures of the test just
> like
> > observed in [1].
> >
> > TEST 23 (line 154): Y check_if_permitted eeefadc
> > /mnt/glusterfs/0/.snaps/snap2/file3 cat
> > su: warning: cannot change directory to /tmp/tmp.eaKBKS0lfM/eeefadc: No
> > such file or directory
> > cat: /mnt/glusterfs/0/.snaps/snap2/file3: Permission denied
> > su: warning: cannot change directory to /tmp/tmp.eaKBKS0lfM/eeefadc: No
> > such file or directory
> > cat: /mnt/glusterfs/0/.snaps/snap2/file3: Permission denied
> > su: warning: cannot change directory to /tmp/tmp.eaKBKS0lfM/eeefadc: No
> > such file or directory
> > cat: /mnt/glusterfs/0/.snaps/snap2/file3: Permission denied
> > su: warning: cannot change directory to /tmp/tmp.eaKBKS0lfM/eeefadc: No
> > such file or directory
> > cat: /mnt/glusterfs/0/.snaps/snap2/file3: Permission denied
> >
> >
> > Test Summary Report
> > ---
> > ./tests/bugs/snapshot/bug-1167580-set-proper-uid-and-
> gid-during-nfs-access.t
> > (Wstat: 0 Tests: 46 Failed: 1)
> >   Failed test:  23
> >
> >
> > I had a feeling this test fails spuriously and the spurious nature is
> tied
> > with whether open-behind uses an anonymous fd or a regular fd for read.
> >
> > @Sunny,
> >
> > This test is blocking two of my patches - [1] and [2]. Can I mark this
> > test as bad and proceed with my work on [1] and [2]?
> >
> > [1] https://review.gluster.org/20511
> > [2] https://review.gluster.org/20428
> >
> > regards,
> > Raghavendra
> >
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Nigel Babu
On Thu, Aug 2, 2018 at 5:12 PM Kotresh Hiremath Ravishankar <
khire...@redhat.com> wrote:

> Don't know, something to do with perf xlators I suppose. It's not
> repdroduced on my local system with brick-mux enabled as well. But it's
> happening on Xavis' system.
>
> Xavi,
> Could you try with the patch [1] and let me know whether it fixes the
> issue.
>
> [1] https://review.gluster.org/#/c/20619/1
>

If you cannot reproduce it on your laptop, why don't you request a machine
from softserve[1] and try it out?

[1]:
https://github.com/gluster/softserve/wiki/Running-Regressions-on-clean-Centos-7-machine

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Kotresh Hiremath Ravishankar
On Thu, Aug 2, 2018 at 5:05 PM, Atin Mukherjee 
wrote:

>
>
> On Thu, Aug 2, 2018 at 4:37 PM Kotresh Hiremath Ravishankar <
> khire...@redhat.com> wrote:
>
>>
>>
>> On Thu, Aug 2, 2018 at 3:49 PM, Xavi Hernandez 
>> wrote:
>>
>>> On Thu, Aug 2, 2018 at 6:14 AM Atin Mukherjee 
>>> wrote:
>>>


 On Tue, Jul 31, 2018 at 10:11 PM Atin Mukherjee 
 wrote:

> I just went through the nightly regression report of brick mux runs
> and here's what I can summarize.
>
> 
> 
> =
> Fails only with brick-mux
> 
> 
> =
> tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even
> after 400 secs. Refer https://fstat.gluster.org/
> failure/209?state=2_date=2018-06-30_date=2018-
> 07-31=all, specifically the latest report
> https://build.gluster.org/job/regression-test-burn-in/4051/consoleText
> . Wasn't timing out as frequently as it was till 12 July. But since 27
> July, it has timed out twice. Beginning to believe commit
> 9400b6f2c8aa219a493961e0ab9770b7f12e80d2 has added the delay and now
> 400 secs isn't sufficient enough (Mohit?)
>
> tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
> (Ref - https://build.gluster.org/job/regression-test-with-
> multiplex/814/console) -  Test fails only in brick-mux mode, AI on
> Atin to look at and get back.
>
> tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
> https://build.gluster.org/job/regression-test-with-
> multiplex/813/console) - Seems like failed just twice in last 30 days
> as per https://fstat.gluster.org/failure/251?state=2_
> date=2018-06-30_date=2018-07-31=all. Need help from AFR
> team.
>
> tests/bugs/quota/bug-1293601.t (https://build.gluster.org/
> job/regression-test-with-multiplex/812/console) - Hasn't failed after
> 26 July and earlier it was failing regularly. Did we fix this test through
> any patch (Mohit?)
>
> tests/bitrot/bug-1373520.t - (https://build.gluster.org/
> job/regression-test-with-multiplex/811/console)  - Hasn't failed
> after 27 July and earlier it was failing regularly. Did we fix this test
> through any patch (Mohit?)
>

 I see this has failed in day before yesterday's regression run as well
 (and I could reproduce it locally with brick mux enabled). The test fails
 in healing a file within a particular time period.

 *15:55:19* not ok 25 Got "0" instead of "512", LINENUM:55*15:55:19* FAILED 
 COMMAND: 512 path_size /d/backends/patchy5/FILE1

 Need EC dev's help here.

>>>
>>> I'm not sure where the problem is exactly. I've seen that when the test
>>> fails, self-heal is attempting to heal the file, but when the file is
>>> accessed, an Input/Output error is returned, aborting heal. I've checked
>>> that a heal is attempted every time the file is accessed, but it fails
>>> always. This error seems to come from bit-rot stub xlator.
>>>
>>> When in this situation, if I stop and start the volume, self-heal
>>> immediately heals the files. It seems like an stale state that is kept by
>>> the stub xlator, preventing the file from being healed.
>>>
>>> Adding bit-rot maintainers for help on this one.
>>>
>>
>> Bitrot-stub marks the file as corrupted in inode_ctx. But when the file
>> and it's hardlink are deleted from that brick and a lookup is done
>> on the file, it cleans up the marker on getting ENOENT. This is part of
>> recovery steps, and only md-cache is disabled during the process.
>> Is there any other perf xlators that needs to be disabled for this
>> scenario to expect a lookup/revalidate on the brick where
>> the back end file is deleted?
>>
>
> But the same test doesn't fail with brick multiplexing not enabled. Do we
> know why?
>
Don't know, something to do with perf xlators I suppose. It's not
repdroduced on my local system with brick-mux enabled as well. But it's
happening on Xavis' system.

Xavi,
Could you try with the patch [1] and let me know whether it fixes the issue.

[1] https://review.gluster.org/#/c/20619/1

>
>
>>
>>> Xavi
>>>
>>>
>>>

> tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a
> core, not sure if related to brick mux or not, so not sure if brick mux is
> culprit here or not. Ref - https://build.gluster.org/job/
> regression-test-with-multiplex/806/console . Seems to be a glustershd
> crash. Need help from AFR folks.
>
> 
> 
> 

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Atin Mukherjee
On Thu, Aug 2, 2018 at 4:37 PM Kotresh Hiremath Ravishankar <
khire...@redhat.com> wrote:

>
>
> On Thu, Aug 2, 2018 at 3:49 PM, Xavi Hernandez 
> wrote:
>
>> On Thu, Aug 2, 2018 at 6:14 AM Atin Mukherjee 
>> wrote:
>>
>>>
>>>
>>> On Tue, Jul 31, 2018 at 10:11 PM Atin Mukherjee 
>>> wrote:
>>>
 I just went through the nightly regression report of brick mux runs and
 here's what I can summarize.


 =
 Fails only with brick-mux

 =
 tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even after
 400 secs. Refer
 https://fstat.gluster.org/failure/209?state=2_date=2018-06-30_date=2018-07-31=all,
 specifically the latest report
 https://build.gluster.org/job/regression-test-burn-in/4051/consoleText
 . Wasn't timing out as frequently as it was till 12 July. But since 27
 July, it has timed out twice. Beginning to believe commit
 9400b6f2c8aa219a493961e0ab9770b7f12e80d2 has added the delay and now 400
 secs isn't sufficient enough (Mohit?)

 tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
 (Ref -
 https://build.gluster.org/job/regression-test-with-multiplex/814/console)
 -  Test fails only in brick-mux mode, AI on Atin to look at and get back.

 tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
 https://build.gluster.org/job/regression-test-with-multiplex/813/console)
 - Seems like failed just twice in last 30 days as per
 https://fstat.gluster.org/failure/251?state=2_date=2018-06-30_date=2018-07-31=all.
 Need help from AFR team.

 tests/bugs/quota/bug-1293601.t (
 https://build.gluster.org/job/regression-test-with-multiplex/812/console)
 - Hasn't failed after 26 July and earlier it was failing regularly. Did we
 fix this test through any patch (Mohit?)

 tests/bitrot/bug-1373520.t - (
 https://build.gluster.org/job/regression-test-with-multiplex/811/console)
 - Hasn't failed after 27 July and earlier it was failing regularly. Did we
 fix this test through any patch (Mohit?)

>>>
>>> I see this has failed in day before yesterday's regression run as well
>>> (and I could reproduce it locally with brick mux enabled). The test fails
>>> in healing a file within a particular time period.
>>>
>>> *15:55:19* not ok 25 Got "0" instead of "512", LINENUM:55*15:55:19* FAILED 
>>> COMMAND: 512 path_size /d/backends/patchy5/FILE1
>>>
>>> Need EC dev's help here.
>>>
>>
>> I'm not sure where the problem is exactly. I've seen that when the test
>> fails, self-heal is attempting to heal the file, but when the file is
>> accessed, an Input/Output error is returned, aborting heal. I've checked
>> that a heal is attempted every time the file is accessed, but it fails
>> always. This error seems to come from bit-rot stub xlator.
>>
>> When in this situation, if I stop and start the volume, self-heal
>> immediately heals the files. It seems like an stale state that is kept by
>> the stub xlator, preventing the file from being healed.
>>
>> Adding bit-rot maintainers for help on this one.
>>
>
> Bitrot-stub marks the file as corrupted in inode_ctx. But when the file
> and it's hardlink are deleted from that brick and a lookup is done
> on the file, it cleans up the marker on getting ENOENT. This is part of
> recovery steps, and only md-cache is disabled during the process.
> Is there any other perf xlators that needs to be disabled for this
> scenario to expect a lookup/revalidate on the brick where
> the back end file is deleted?
>

But the same test doesn't fail with brick multiplexing not enabled. Do we
know why?


>
>> Xavi
>>
>>
>>
>>>
 tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a core,
 not sure if related to brick mux or not, so not sure if brick mux is
 culprit here or not. Ref -
 https://build.gluster.org/job/regression-test-with-multiplex/806/console
 . Seems to be a glustershd crash. Need help from AFR folks.


 =
 Fails for non-brick mux case too

 =
 tests/bugs/distribute/bug-1122443.t 0 Seems to be failing at my setup
 very often, with out brick mux as well. Refer
 https://build.gluster.org/job/regression-test-burn-in/4050/consoleText
 . There's an email in gluster-devel and a BZ 

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Kotresh Hiremath Ravishankar
On Thu, Aug 2, 2018 at 4:50 PM, Amar Tumballi  wrote:

>
>
> On Thu, Aug 2, 2018 at 4:37 PM, Kotresh Hiremath Ravishankar <
> khire...@redhat.com> wrote:
>
>>
>>
>> On Thu, Aug 2, 2018 at 3:49 PM, Xavi Hernandez 
>> wrote:
>>
>>> On Thu, Aug 2, 2018 at 6:14 AM Atin Mukherjee 
>>> wrote:
>>>


 On Tue, Jul 31, 2018 at 10:11 PM Atin Mukherjee 
 wrote:

> I just went through the nightly regression report of brick mux runs
> and here's what I can summarize.
>
> 
> 
> =
> Fails only with brick-mux
> 
> 
> =
> tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even
> after 400 secs. Refer https://fstat.gluster.org/fail
> ure/209?state=2_date=2018-06-30_date=2018-07-31=all,
> specifically the latest report https://build.gluster.org/job/
> regression-test-burn-in/4051/consoleText . Wasn't timing out as
> frequently as it was till 12 July. But since 27 July, it has timed out
> twice. Beginning to believe commit 
> 9400b6f2c8aa219a493961e0ab9770b7f12e80d2
> has added the delay and now 400 secs isn't sufficient enough (Mohit?)
>
> tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
> (Ref - https://build.gluster.org/job/regression-test-with-multiplex
> /814/console) -  Test fails only in brick-mux mode, AI on Atin to
> look at and get back.
>
> tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
> https://build.gluster.org/job/regression-test-with-multiple
> x/813/console) - Seems like failed just twice in last 30 days as per
> https://fstat.gluster.org/failure/251?state=2_date=201
> 8-06-30_date=2018-07-31=all. Need help from AFR team.
>
> tests/bugs/quota/bug-1293601.t (https://build.gluster.org/job
> /regression-test-with-multiplex/812/console) - Hasn't failed after 26
> July and earlier it was failing regularly. Did we fix this test through 
> any
> patch (Mohit?)
>
> tests/bitrot/bug-1373520.t - (https://build.gluster.org/job
> /regression-test-with-multiplex/811/console)  - Hasn't failed after
> 27 July and earlier it was failing regularly. Did we fix this test through
> any patch (Mohit?)
>

 I see this has failed in day before yesterday's regression run as well
 (and I could reproduce it locally with brick mux enabled). The test fails
 in healing a file within a particular time period.

 *15:55:19* not ok 25 Got "0" instead of "512", LINENUM:55*15:55:19* FAILED 
 COMMAND: 512 path_size /d/backends/patchy5/FILE1

 Need EC dev's help here.

>>>
>>> I'm not sure where the problem is exactly. I've seen that when the test
>>> fails, self-heal is attempting to heal the file, but when the file is
>>> accessed, an Input/Output error is returned, aborting heal. I've checked
>>> that a heal is attempted every time the file is accessed, but it fails
>>> always. This error seems to come from bit-rot stub xlator.
>>>
>>> When in this situation, if I stop and start the volume, self-heal
>>> immediately heals the files. It seems like an stale state that is kept by
>>> the stub xlator, preventing the file from being healed.
>>>
>>> Adding bit-rot maintainers for help on this one.
>>>
>>
>> Bitrot-stub marks the file as corrupted in inode_ctx. But when the file
>> and it's hardlink are deleted from that brick and a lookup is done
>> on the file, it cleans up the marker on getting ENOENT. This is part of
>> recovery steps, and only md-cache is disabled during the process.
>> Is there any other perf xlators that needs to be disabled for this
>> scenario to expect a lookup/revalidate on the brick where
>> the back end file is deleted?
>>
>
> Can you make sure there are no perf xlators in bitrot stack while doing
> it? That may not be a good idea to keep it for internal 'validations'.
>

Ok, sending the patch in sometime.

>
>
>>
>>> Xavi
>>>
>>>
>>>

> tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a
> core, not sure if related to brick mux or not, so not sure if brick mux is
> culprit here or not. Ref - https://build.gluster.org/job/
> regression-test-with-multiplex/806/console . Seems to be a glustershd
> crash. Need help from AFR folks.
>
> 
> 
> =
> Fails for non-brick mux case too
> 
> 

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Amar Tumballi
On Thu, Aug 2, 2018 at 4:37 PM, Kotresh Hiremath Ravishankar <
khire...@redhat.com> wrote:

>
>
> On Thu, Aug 2, 2018 at 3:49 PM, Xavi Hernandez 
> wrote:
>
>> On Thu, Aug 2, 2018 at 6:14 AM Atin Mukherjee 
>> wrote:
>>
>>>
>>>
>>> On Tue, Jul 31, 2018 at 10:11 PM Atin Mukherjee 
>>> wrote:
>>>
 I just went through the nightly regression report of brick mux runs and
 here's what I can summarize.

 
 
 =
 Fails only with brick-mux
 
 
 =
 tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even after
 400 secs. Refer https://fstat.gluster.org/fail
 ure/209?state=2_date=2018-06-30_date=2018-07-31=all,
 specifically the latest report https://build.gluster.org/job/
 regression-test-burn-in/4051/consoleText . Wasn't timing out as
 frequently as it was till 12 July. But since 27 July, it has timed out
 twice. Beginning to believe commit 9400b6f2c8aa219a493961e0ab9770b7f12e80d2
 has added the delay and now 400 secs isn't sufficient enough (Mohit?)

 tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
 (Ref - https://build.gluster.org/job/regression-test-with-multiplex
 /814/console) -  Test fails only in brick-mux mode, AI on Atin to look
 at and get back.

 tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
 https://build.gluster.org/job/regression-test-with-multiple
 x/813/console) - Seems like failed just twice in last 30 days as per
 https://fstat.gluster.org/failure/251?state=2_date=
 2018-06-30_date=2018-07-31=all. Need help from AFR team.

 tests/bugs/quota/bug-1293601.t (https://build.gluster.org/job
 /regression-test-with-multiplex/812/console) - Hasn't failed after 26
 July and earlier it was failing regularly. Did we fix this test through any
 patch (Mohit?)

 tests/bitrot/bug-1373520.t - (https://build.gluster.org/job
 /regression-test-with-multiplex/811/console)  - Hasn't failed after 27
 July and earlier it was failing regularly. Did we fix this test through any
 patch (Mohit?)

>>>
>>> I see this has failed in day before yesterday's regression run as well
>>> (and I could reproduce it locally with brick mux enabled). The test fails
>>> in healing a file within a particular time period.
>>>
>>> *15:55:19* not ok 25 Got "0" instead of "512", LINENUM:55*15:55:19* FAILED 
>>> COMMAND: 512 path_size /d/backends/patchy5/FILE1
>>>
>>> Need EC dev's help here.
>>>
>>
>> I'm not sure where the problem is exactly. I've seen that when the test
>> fails, self-heal is attempting to heal the file, but when the file is
>> accessed, an Input/Output error is returned, aborting heal. I've checked
>> that a heal is attempted every time the file is accessed, but it fails
>> always. This error seems to come from bit-rot stub xlator.
>>
>> When in this situation, if I stop and start the volume, self-heal
>> immediately heals the files. It seems like an stale state that is kept by
>> the stub xlator, preventing the file from being healed.
>>
>> Adding bit-rot maintainers for help on this one.
>>
>
> Bitrot-stub marks the file as corrupted in inode_ctx. But when the file
> and it's hardlink are deleted from that brick and a lookup is done
> on the file, it cleans up the marker on getting ENOENT. This is part of
> recovery steps, and only md-cache is disabled during the process.
> Is there any other perf xlators that needs to be disabled for this
> scenario to expect a lookup/revalidate on the brick where
> the back end file is deleted?
>

Can you make sure there are no perf xlators in bitrot stack while doing it?
That may not be a good idea to keep it for internal 'validations'.


>
>> Xavi
>>
>>
>>
>>>
 tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a
 core, not sure if related to brick mux or not, so not sure if brick mux is
 culprit here or not. Ref - https://build.gluster.org/job/
 regression-test-with-multiplex/806/console . Seems to be a glustershd
 crash. Need help from AFR folks.

 
 
 =
 Fails for non-brick mux case too
 
 
 =
 tests/bugs/distribute/bug-1122443.t 0 Seems to be failing at my setup
 very often, with out brick mux as well. Refer
 

Re: [Gluster-devel] ./tests/bugs/snapshot/bug-1167580-set-proper-uid-and-gid-during-nfs-access.t fails if non-anonymous fds are used in read path

2018-08-02 Thread Rafi Kavungal Chundattu Parambil
Yes, I think we can mark the test as bad for now. We found two issues that 
cause the failures.

One issue is with the usage of anonymous fd from a fuse mount. posix acl which 
sits on the brick graph does the authentication check during open. But with 
anonymous FD's we may not have an explicit open received before let's a read 
fop. As a result, posix acl is not getting honoured with anonymous fd.

The second issue is with snapd and libgfapi where it uses libgfapi to get the 
information from snapshot bricks. But uid, and gid's received from a client are 
not passed through libgfapi.

I will fail two separate bugs to track this issue.

Since both of this issues are not relevant to the fix which Raghavendra send, I 
agree to mark the tests as bad.


Regards
Rafi KC


- Original Message -
From: "Raghavendra Gowdappa" 
To: "Sunny Kumar" , "Rafi" 
Cc: "Gluster Devel" 
Sent: Thursday, August 2, 2018 3:23:00 PM
Subject: Re: 
./tests/bugs/snapshot/bug-1167580-set-proper-uid-and-gid-during-nfs-access.t 
fails if non-anonymous fds are used in read path

I've filed  a bug to track this failure:
https://bugzilla.redhat.com/show_bug.cgi?id=1611532

As a stop gap measure I propose to mark the test as Bad to unblock patches
[1][2]. Are maintainers of snapshot in agreement with this?

regards,
Raghavendra

On Wed, Aug 1, 2018 at 10:28 AM, Raghavendra Gowdappa 
wrote:

> Sunny/Rafi,
>
> I was trying to debug regression failures on [1]. Note that patch [1] only
> disables usage of anonymous fds on readv. So, I tried the same test
> disabling performance.open-behind
>
> [root@rhs-client27 glusterfs]# git diff
> diff --git 
> a/tests/bugs/snapshot/bug-1167580-set-proper-uid-and-gid-during-nfs-access.t
> b/tests/bugs/snapshot/bug-1167580-set-proper-uid-and-
> gid-during-nfs-access.t
> index 3776451..cedf96b 100644
> --- a/tests/bugs/snapshot/bug-1167580-set-proper-uid-and-
> gid-during-nfs-access.t
> +++ b/tests/bugs/snapshot/bug-1167580-set-proper-uid-and-
> gid-during-nfs-access.t
> @@ -79,6 +79,7 @@ TEST $CLI volume start $V0
>  EXPECT_WITHIN $NFS_EXPORT_TIMEOUT "1" is_nfs_export_available
>  TEST glusterfs -s $H0 --volfile-id $V0 $M0
>  TEST mount_nfs $H0:/$V0 $N0 nolock
> +TEST $CLI volume set $V0 performance.open-behind off
>
>  # Create 2 user
>  user1=$(get_new_user)
>
>
> With the above change, I can see consistent failures of the test just like
> observed in [1].
>
> TEST 23 (line 154): Y check_if_permitted eeefadc
> /mnt/glusterfs/0/.snaps/snap2/file3 cat
> su: warning: cannot change directory to /tmp/tmp.eaKBKS0lfM/eeefadc: No
> such file or directory
> cat: /mnt/glusterfs/0/.snaps/snap2/file3: Permission denied
> su: warning: cannot change directory to /tmp/tmp.eaKBKS0lfM/eeefadc: No
> such file or directory
> cat: /mnt/glusterfs/0/.snaps/snap2/file3: Permission denied
> su: warning: cannot change directory to /tmp/tmp.eaKBKS0lfM/eeefadc: No
> such file or directory
> cat: /mnt/glusterfs/0/.snaps/snap2/file3: Permission denied
> su: warning: cannot change directory to /tmp/tmp.eaKBKS0lfM/eeefadc: No
> such file or directory
> cat: /mnt/glusterfs/0/.snaps/snap2/file3: Permission denied
>
>
> Test Summary Report
> ---
> ./tests/bugs/snapshot/bug-1167580-set-proper-uid-and-gid-during-nfs-access.t
> (Wstat: 0 Tests: 46 Failed: 1)
>   Failed test:  23
>
>
> I had a feeling this test fails spuriously and the spurious nature is tied
> with whether open-behind uses an anonymous fd or a regular fd for read.
>
> @Sunny,
>
> This test is blocking two of my patches - [1] and [2]. Can I mark this
> test as bad and proceed with my work on [1] and [2]?
>
> [1] https://review.gluster.org/20511
> [2] https://review.gluster.org/20428
>
> regards,
> Raghavendra
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] ./tests/bugs/snapshot/bug-1167580-set-proper-uid-and-gid-during-nfs-access.t fails if non-anonymous fds are used in read path

2018-08-02 Thread Raghavendra Gowdappa
I've filed  a bug to track this failure:
https://bugzilla.redhat.com/show_bug.cgi?id=1611532

As a stop gap measure I propose to mark the test as Bad to unblock patches
[1][2]. Are maintainers of snapshot in agreement with this?

regards,
Raghavendra

On Wed, Aug 1, 2018 at 10:28 AM, Raghavendra Gowdappa 
wrote:

> Sunny/Rafi,
>
> I was trying to debug regression failures on [1]. Note that patch [1] only
> disables usage of anonymous fds on readv. So, I tried the same test
> disabling performance.open-behind
>
> [root@rhs-client27 glusterfs]# git diff
> diff --git 
> a/tests/bugs/snapshot/bug-1167580-set-proper-uid-and-gid-during-nfs-access.t
> b/tests/bugs/snapshot/bug-1167580-set-proper-uid-and-
> gid-during-nfs-access.t
> index 3776451..cedf96b 100644
> --- a/tests/bugs/snapshot/bug-1167580-set-proper-uid-and-
> gid-during-nfs-access.t
> +++ b/tests/bugs/snapshot/bug-1167580-set-proper-uid-and-
> gid-during-nfs-access.t
> @@ -79,6 +79,7 @@ TEST $CLI volume start $V0
>  EXPECT_WITHIN $NFS_EXPORT_TIMEOUT "1" is_nfs_export_available
>  TEST glusterfs -s $H0 --volfile-id $V0 $M0
>  TEST mount_nfs $H0:/$V0 $N0 nolock
> +TEST $CLI volume set $V0 performance.open-behind off
>
>  # Create 2 user
>  user1=$(get_new_user)
>
>
> With the above change, I can see consistent failures of the test just like
> observed in [1].
>
> TEST 23 (line 154): Y check_if_permitted eeefadc
> /mnt/glusterfs/0/.snaps/snap2/file3 cat
> su: warning: cannot change directory to /tmp/tmp.eaKBKS0lfM/eeefadc: No
> such file or directory
> cat: /mnt/glusterfs/0/.snaps/snap2/file3: Permission denied
> su: warning: cannot change directory to /tmp/tmp.eaKBKS0lfM/eeefadc: No
> such file or directory
> cat: /mnt/glusterfs/0/.snaps/snap2/file3: Permission denied
> su: warning: cannot change directory to /tmp/tmp.eaKBKS0lfM/eeefadc: No
> such file or directory
> cat: /mnt/glusterfs/0/.snaps/snap2/file3: Permission denied
> su: warning: cannot change directory to /tmp/tmp.eaKBKS0lfM/eeefadc: No
> such file or directory
> cat: /mnt/glusterfs/0/.snaps/snap2/file3: Permission denied
>
>
> Test Summary Report
> ---
> ./tests/bugs/snapshot/bug-1167580-set-proper-uid-and-gid-during-nfs-access.t
> (Wstat: 0 Tests: 46 Failed: 1)
>   Failed test:  23
>
>
> I had a feeling this test fails spuriously and the spurious nature is tied
> with whether open-behind uses an anonymous fd or a regular fd for read.
>
> @Sunny,
>
> This test is blocking two of my patches - [1] and [2]. Can I mark this
> test as bad and proceed with my work on [1] and [2]?
>
> [1] https://review.gluster.org/20511
> [2] https://review.gluster.org/20428
>
> regards,
> Raghavendra
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] FreeBSD smoke test may fail for older changes, rebase needed

2018-08-02 Thread Nigel Babu
> That is fine with me. It is prepared for GlusterFS 5, so nothing needs
> to be done for that. Only for 4.1 and 3.12 FreeBSD needs to be disabled
> from the smoke job(s).
>
> I could not find the repo that contains the smoke job, otherwise I would
> have tried to send a PR.
>
> Niels
>

For future reference, any "production" job that's on build.gluster.org will
have a corresponding job on build-jobs[1] on review.gluster.org. This has
been announced in the past and non-CI team members have sent us patches and
new jobs. There may be some jobs that do not have a corresponding yml file,
this is most likely because they're WIP or not production ready.

[1] http://git.gluster.org/cgit/build-jobs.git/

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Kotresh Hiremath Ravishankar
On Thu, Aug 2, 2018 at 11:43 AM, Xavi Hernandez 
wrote:

> On Thu, Aug 2, 2018 at 6:14 AM Atin Mukherjee  wrote:
>
>>
>>
>> On Tue, Jul 31, 2018 at 10:11 PM Atin Mukherjee 
>> wrote:
>>
>>> I just went through the nightly regression report of brick mux runs and
>>> here's what I can summarize.
>>>
>>> 
>>> 
>>> =
>>> Fails only with brick-mux
>>> 
>>> 
>>> =
>>> tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even after
>>> 400 secs. Refer https://fstat.gluster.org/failure/209?state=2_
>>> date=2018-06-30_date=2018-07-31=all, specifically the latest
>>> report https://build.gluster.org/job/regression-test-burn-in/4051/
>>> consoleText . Wasn't timing out as frequently as it was till 12 July.
>>> But since 27 July, it has timed out twice. Beginning to believe commit
>>> 9400b6f2c8aa219a493961e0ab9770b7f12e80d2 has added the delay and now
>>> 400 secs isn't sufficient enough (Mohit?)
>>>
>>> tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
>>> (Ref - https://build.gluster.org/job/regression-test-with-
>>> multiplex/814/console) -  Test fails only in brick-mux mode, AI on Atin
>>> to look at and get back.
>>>
>>> tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
>>> https://build.gluster.org/job/regression-test-with-multiplex/813/console)
>>> - Seems like failed just twice in last 30 days as per
>>> https://fstat.gluster.org/failure/251?state=2_
>>> date=2018-06-30_date=2018-07-31=all. Need help from AFR team.
>>>
>>> tests/bugs/quota/bug-1293601.t (https://build.gluster.org/
>>> job/regression-test-with-multiplex/812/console) - Hasn't failed after
>>> 26 July and earlier it was failing regularly. Did we fix this test through
>>> any patch (Mohit?)
>>>
>>> tests/bitrot/bug-1373520.t - (https://build.gluster.org/
>>> job/regression-test-with-multiplex/811/console)  - Hasn't failed after
>>> 27 July and earlier it was failing regularly. Did we fix this test through
>>> any patch (Mohit?)
>>>
>>
>> I see this has failed in day before yesterday's regression run as well
>> (and I could reproduce it locally with brick mux enabled). The test fails
>> in healing a file within a particular time period.
>>
>> *15:55:19* not ok 25 Got "0" instead of "512", LINENUM:55*15:55:19* FAILED 
>> COMMAND: 512 path_size /d/backends/patchy5/FILE1
>>
>> Need EC dev's help here.
>>
>
> I'll investigate this.
>
>
>>
>>
>>> tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a core,
>>> not sure if related to brick mux or not, so not sure if brick mux is
>>> culprit here or not. Ref - https://build.gluster.org/job/
>>> regression-test-with-multiplex/806/console . Seems to be a glustershd
>>> crash. Need help from AFR folks.
>>>
>>> 
>>> 
>>> =
>>> Fails for non-brick mux case too
>>> 
>>> 
>>> =
>>> tests/bugs/distribute/bug-1122443.t 0 Seems to be failing at my setup
>>> very often, with out brick mux as well. Refer
>>> https://build.gluster.org/job/regression-test-burn-in/4050/consoleText
>>> . There's an email in gluster-devel and a BZ 1610240 for the same.
>>>
>>> tests/bugs/bug-1368312.t - Seems to be recent failures (
>>> https://build.gluster.org/job/regression-test-with-multiplex/815/console)
>>> - seems to be a new failure, however seen this for a non-brick-mux case too
>>> - https://build.gluster.org/job/regression-test-burn-in/4039/consoleText
>>> . Need some eyes from AFR folks.
>>>
>>> tests/00-geo-rep/georep-basic-dr-tarssh.t - this isn't specific to
>>> brick mux, have seen this failing at multiple default regression runs.
>>> Refer https://fstat.gluster.org/failure/392?state=2_
>>> date=2018-06-30_date=2018-07-31=all . We need help from
>>> geo-rep dev to root cause this earlier than later
>>>
>>> tests/00-geo-rep/georep-basic-dr-rsync.t - this isn't specific to brick
>>> mux, have seen this failing at multiple default regression runs. Refer
>>> https://fstat.gluster.org/failure/393?state=2_
>>> date=2018-06-30_date=2018-07-31=all . We need help from
>>> geo-rep dev to root cause this earlier than later
>>>
>>
I have posted the patch [1] for above two. This should handle connection
time outs without any logs. But I still see a strange behaviour now and then
where the one of the worker doesn't get started at all. I am debugging that
with instrumentation patch [2]. I am not hitting that on this 

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Xavi Hernandez
On Thu, Aug 2, 2018 at 6:14 AM Atin Mukherjee  wrote:

>
>
> On Tue, Jul 31, 2018 at 10:11 PM Atin Mukherjee 
> wrote:
>
>> I just went through the nightly regression report of brick mux runs and
>> here's what I can summarize.
>>
>>
>> =
>> Fails only with brick-mux
>>
>> =
>> tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even after
>> 400 secs. Refer
>> https://fstat.gluster.org/failure/209?state=2_date=2018-06-30_date=2018-07-31=all,
>> specifically the latest report
>> https://build.gluster.org/job/regression-test-burn-in/4051/consoleText .
>> Wasn't timing out as frequently as it was till 12 July. But since 27 July,
>> it has timed out twice. Beginning to believe commit
>> 9400b6f2c8aa219a493961e0ab9770b7f12e80d2 has added the delay and now 400
>> secs isn't sufficient enough (Mohit?)
>>
>> tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
>> (Ref -
>> https://build.gluster.org/job/regression-test-with-multiplex/814/console)
>> -  Test fails only in brick-mux mode, AI on Atin to look at and get back.
>>
>> tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
>> https://build.gluster.org/job/regression-test-with-multiplex/813/console)
>> - Seems like failed just twice in last 30 days as per
>> https://fstat.gluster.org/failure/251?state=2_date=2018-06-30_date=2018-07-31=all.
>> Need help from AFR team.
>>
>> tests/bugs/quota/bug-1293601.t (
>> https://build.gluster.org/job/regression-test-with-multiplex/812/console)
>> - Hasn't failed after 26 July and earlier it was failing regularly. Did we
>> fix this test through any patch (Mohit?)
>>
>> tests/bitrot/bug-1373520.t - (
>> https://build.gluster.org/job/regression-test-with-multiplex/811/console)
>> - Hasn't failed after 27 July and earlier it was failing regularly. Did we
>> fix this test through any patch (Mohit?)
>>
>
> I see this has failed in day before yesterday's regression run as well
> (and I could reproduce it locally with brick mux enabled). The test fails
> in healing a file within a particular time period.
>
> *15:55:19* not ok 25 Got "0" instead of "512", LINENUM:55*15:55:19* FAILED 
> COMMAND: 512 path_size /d/backends/patchy5/FILE1
>
> Need EC dev's help here.
>

I'll investigate this.


>
>
>> tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a core,
>> not sure if related to brick mux or not, so not sure if brick mux is
>> culprit here or not. Ref -
>> https://build.gluster.org/job/regression-test-with-multiplex/806/console
>> . Seems to be a glustershd crash. Need help from AFR folks.
>>
>>
>> =
>> Fails for non-brick mux case too
>>
>> =
>> tests/bugs/distribute/bug-1122443.t 0 Seems to be failing at my setup
>> very often, with out brick mux as well. Refer
>> https://build.gluster.org/job/regression-test-burn-in/4050/consoleText .
>> There's an email in gluster-devel and a BZ 1610240 for the same.
>>
>> tests/bugs/bug-1368312.t - Seems to be recent failures (
>> https://build.gluster.org/job/regression-test-with-multiplex/815/console)
>> - seems to be a new failure, however seen this for a non-brick-mux case too
>> - https://build.gluster.org/job/regression-test-burn-in/4039/consoleText
>> . Need some eyes from AFR folks.
>>
>> tests/00-geo-rep/georep-basic-dr-tarssh.t - this isn't specific to brick
>> mux, have seen this failing at multiple default regression runs. Refer
>> https://fstat.gluster.org/failure/392?state=2_date=2018-06-30_date=2018-07-31=all
>> . We need help from geo-rep dev to root cause this earlier than later
>>
>> tests/00-geo-rep/georep-basic-dr-rsync.t - this isn't specific to brick
>> mux, have seen this failing at multiple default regression runs. Refer
>> https://fstat.gluster.org/failure/393?state=2_date=2018-06-30_date=2018-07-31=all
>> . We need help from geo-rep dev to root cause this earlier than later
>>
>> tests/bugs/glusterd/validating-server-quorum.t (
>> https://build.gluster.org/job/regression-test-with-multiplex/810/console)
>> - Fails for non-brick-mux cases too,
>> https://fstat.gluster.org/failure/580?state=2_date=2018-06-30_date=2018-07-31=all
>> .  Atin has a patch https://review.gluster.org/20584 which resolves it
>> but patch is failing regression for a different test which is unrelated.
>>
>>