Re: [Gluster-devel] Status update : Brick Mux threads reduction

2018-10-03 Thread Atin Mukherjee
I have rebased [1] and triggered brick-mux regression as we fixed one
genuine snapshot test failure in brick mux through
https://review.gluster.org/#/c/glusterfs/+/21314/ which got merged today.

On Thu, Oct 4, 2018 at 10:39 AM Poornima Gurusiddaiah 
wrote:

> Hi,
>
> For each brick, we create atleast 20+ threads, hence in a brick mux use
> case, where we load multiple bricks in the same process, there will 100s of
> threads resulting in perf issues, memory usage increase.
>
> IO-threads :  Make it global, to the process, and ref count the resource.
> patch [1], has failures in brick mux regression, likey not related to the
> patch, need to get it passed.
>
> Posix- threads : Janitor, Helper, Fsyncer, instead of using one thread per
> task, use synctask framework instead. In the future use thread pool in
> patch [2]. Patches are posted[1], fixing some regression failures.
>
> Posix, bitrot aio-thread : This thread cannot be replaced to just use
> synctask/thread pool as there cannot be a delay in recieving notifications
> and acting on it. Hence, create a global aio event receiver thread for the
> process. This is WIP and is not yet posted upstream.
>
> Threads in changelog/bitrot xlator Mohit posted a patch where default
> xlator does not need to start a thread if xlator is not enabled
> https://review.gluster.org/#/c/glusterfs/+/21304/ (it can save 6 thread
> per brick in default option)
>
> Pending: Create a build of these patches, run perf tests with these
> patches and analyze the same.
>
>
> [1] https://review.gluster.org/#/c/glusterfs/+/20761/
> [2] https://review.gluster.org/#/c/glusterfs/+/20636/
>
> Regards,
> Poornima
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Status update : Brick Mux threads reduction

2018-10-03 Thread Poornima Gurusiddaiah
Hi,

For each brick, we create atleast 20+ threads, hence in a brick mux use
case, where we load multiple bricks in the same process, there will 100s of
threads resulting in perf issues, memory usage increase.

IO-threads :  Make it global, to the process, and ref count the resource.
patch [1], has failures in brick mux regression, likey not related to the
patch, need to get it passed.

Posix- threads : Janitor, Helper, Fsyncer, instead of using one thread per
task, use synctask framework instead. In the future use thread pool in
patch [2]. Patches are posted[1], fixing some regression failures.

Posix, bitrot aio-thread : This thread cannot be replaced to just use
synctask/thread pool as there cannot be a delay in recieving notifications
and acting on it. Hence, create a global aio event receiver thread for the
process. This is WIP and is not yet posted upstream.

Threads in changelog/bitrot xlator Mohit posted a patch where default
xlator does not need to start a thread if xlator is not enabled
https://review.gluster.org/#/c/glusterfs/+/21304/ (it can save 6 thread per
brick in default option)

Pending: Create a build of these patches, run perf tests with these patches
and analyze the same.


[1] https://review.gluster.org/#/c/glusterfs/+/20761/
[2] https://review.gluster.org/#/c/glusterfs/+/20636/

Regards,
Poornima
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] POC- Distributed regression testing framework

2018-10-03 Thread Sanju Rakonde
On Wed, Oct 3, 2018 at 3:26 PM Deepshikha Khandelwal 
wrote:

> Hello folks,
>
> Distributed-regression job[1] is now a part of Gluster's
> nightly-master build pipeline. The following are the issues we have
> resolved since we started working on this:
>
> 1) Collecting gluster logs from servers.
> 2) Tests failed due to infra-related issues have been fixed.
> 3) Time taken to run regression testing reduced to ~50-60 minutes.
>
> To get time down to 40 minutes needs your help!
>
> Currently, there is a test that is failing:
>
> tests/bugs/glusterd/optimized-basic-testcases-in-cluster.t
>
> This needs fixing first.
>

Where can I get the logs of this test case? In
https://build.gluster.org/job/distributed-regression/264/console I see this
test case is failed and re-attempted. But I couldn't find logs.

>
> There's a test that takes 14 minutes to complete -
> `tests/bugs/index/bug-1559004-EMLINK-handling.t`. A single test taking
> 14 minutes is not something we can distribute. Can we look at how we
> can speed this up[2]? When this test fails, it is re-attempted,
> further increasing the time. This happens in the regular
> centos7-regression job as well.
>
> If you see any other issues, please file a bug[3].
>
> [1]: https://build.gluster.org/job/distributed-regression
> [2]: https://build.gluster.org/job/distributed-regression/264/console
> [3]:
> https://bugzilla.redhat.com/enter_bug.cgi?product=glusterfs&component=project-infrastructure
>
> Thanks,
> Deepshikha Khandelwal
> On Tue, Jun 26, 2018 at 9:02 AM Nigel Babu  wrote:
> >
> >
> >
> > On Mon, Jun 25, 2018 at 7:28 PM Amar Tumballi 
> wrote:
> >>
> >>
> >>
> >>> There are currently a few known issues:
> >>> * Not collecting the entire logs (/var/log/glusterfs) from servers.
> >>
> >>
> >> If I look at the activities involved with regression failures, this can
> wait.
> >
> >
> > Well, we can't debug the current failures without having the logs. So
> this has to be fixed first.
> >
> >>
> >>
> >>>
> >>> * A few tests fail due to infra-related issues like geo-rep tests.
> >>
> >>
> >> Please open bugs for this, so we can track them, and take it to closure.
> >
> >
> > These are failing due to infra reasons. Most likely subtle differences
> in the setup of these nodes vs our normal nodes. We'll only be able to
> debug them once we get the logs. I know the geo-rep ones are easy to fix.
> The playbook for setting up geo-rep correctly just didn't make it over to
> the playbook used for these images.
> >
> >>
> >>
> >>>
> >>> * Takes ~80 minutes with 7 distributed servers (targetting 60 minutes)
> >>
> >>
> >> Time can change with more tests added, and also please plan to have
> number of server as 1 to n.
> >
> >
> > While the n is configurable, however it will be fixed to a single digit
> number for now. We will need to place *some* limitation somewhere or else
> we'll end up not being able to control our cloud bills.
> >
> >>
> >>
> >>>
> >>> * We've only tested plain regressions. ASAN and Valgrind are currently
> untested.
> >>
> >>
> >> Great to have it running not 'per patch', but as nightly, or weekly to
> start with.
> >
> >
> > This is currently not targeted until we phase out current regressions.
> >
> >>>
> >>>
> >>> Before bringing it into production, we'll run this job nightly and
> >>> watch it for a month to debug the other failures.
> >>>
> >>
> >> I would say, bring it to production sooner, say 2 weeks, and also plan
> to have the current regression as is with a special command like 'run
> regression in-one-machine' in gerrit (or something similar) with voting
> rights, so we can fall back to this method if something is broken in
> parallel testing.
> >>
> >> I have seen that regardless of amount of time we put some scripts in
> testing, the day we move to production, some thing would be broken. So, let
> that happen earlier than later, so it would help next release branching
> out. Don't want to be stuck for branching due to infra failures.
> >
> >
> > Having two regression jobs that can vote is going to cause more
> confusion than it's worth. There are a couple of intermittent memory issues
> with the test script that we need to debug and fix before I'm comfortable
> in making this job a voting job. We've worked around these problems right
> now. It still pops up now and again. The fact that things break often is
> not an excuse to prevent avoidable failures.  The one month timeline was
> taken with all these factors into consideration. The 2-week timeline is a
> no-go at this point.
> >
> > When we are ready to make the switch, we won't be switching 100% of the
> job. We'll start with a sliding scale so that we can monitor failures and
> machine creation adequately.
> >
> > --
> > nigelb
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>


-- 
Thanks,
Sanju
___
Gluster-devel mailin

Re: [Gluster-devel] [Gluster-users] KVM lockups on Gluster 4.1.1

2018-10-03 Thread Dmitry Melekhov

02.10.2018 12:59, Amar Tumballi пишет:
Recently, in one of the situation, we found that locks were not freed 
up due to not getting TCP timeout..


Can you try the option like below and let us know?

`gluster volume set $volname tcp-user-timeout 42`

(ref: https://review.gluster.org/21170/ )

Regards,
Amar



Thank you, we'll try this.



On Tue, Oct 2, 2018 at 10:40 AM Dmitry Melekhov > wrote:


01.10.2018 23:09, Danny Lee пишет:

Ran into this issue too with 4.1.5 with an arbiter setup.  Also
could not run a statedump due to "Segmentation fault".

Tried with 3.12.13 and had issues with locked files as well.  We
were able to do a statedump and found that some of our files were
"BLOCKED" (xlator.features.locks.vol-locks.inode).  Attached part
of statedump.

Also tried clearing the locks using clear-locks, which did remove
the lock, but as soon as I tried to cat the file, it got locked
again and the cat process hung.


I created issue in bugzilla, can't find it though :-(
Looks like there is no activity after I sent all logs...




On Wed, Aug 29, 2018, 3:13 AM Dmitry Melekhov mailto:d...@belkam.com>> wrote:

28.08.2018 10:43, Amar Tumballi пишет:



On Tue, Aug 28, 2018 at 11:24 AM, Dmitry Melekhov
mailto:d...@belkam.com>> wrote:

Hello!


Yesterday we hit something like this on 4.1.2

Centos 7.5.


Volume is replicated - two bricks and one arbiter.


We rebooted arbiter, waited for heal end, and tried to
live migrate VM to another node ( we run VMs on gluster
nodes ):


[2018-08-27 09:56:22.085411] I [MSGID: 115029]
[server-handshake.c:763:server_setvolume] 0-pool-server:
accepted client from

CTX_ID:b55f4a90-e241-48ce-bd4d-268c8a956f4a-GRAPH_ID:0-PID:8887-HOST:son-PC_NAME:pool-
client-6-RECON_NO:-0 (version: 4.1.2)
[2018-08-27 09:56:22.107609] I [MSGID: 115036]
[server.c:483:server_rpc_notify] 0-pool-server:
disconnecting connection from

CTX_ID:b55f4a90-e241-48ce-bd4d-268c8a956f4a-GRAPH_ID:0-PID:8887-HOST:son-PC_NAME:pool-
client-6-RECON_NO:-0
[2018-08-27 09:56:22.107747] I [MSGID: 101055]
[client_t.c:444:gf_client_unref] 0-pool-server: Shutting
down connection

CTX_ID:b55f4a90-e241-48ce-bd4d-268c8a956f4a-GRAPH_ID:0-PID:8887-HOST:son-PC_NAME:pool-clien
t-6-RECON_NO:-0
[2018-08-27 09:58:37.905829] I [MSGID: 115036]
[server.c:483:server_rpc_notify] 0-pool-server:
disconnecting connection from

CTX_ID:c3eb6cfc-2ef9-470a-89d1-a87170d00da5-GRAPH_ID:0-PID:30292-HOST:father-PC_NAME:p
ool-client-6-RECON_NO:-0
[2018-08-27 09:58:37.905926] W
[inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server:
releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318
held by {client=0x7ffb58035bc0, pid=30292
lk-owner=28c831d8bc55}
[2018-08-27 09:58:37.905959] W
[inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server:
releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318
held by {client=0x7ffb58035bc0, pid=30292
lk-owner=2870a7d6bc55}
[2018-08-27 09:58:37.905979] W
[inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server:
releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318
held by {client=0x7ffb58035bc0, pid=30292
lk-owner=2880a7d6bc55}
[2018-08-27 09:58:37.905997] W
[inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server:
releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318
held by {client=0x7ffb58035bc0, pid=30292
lk-owner=28f031d8bc55}
[2018-08-27 09:58:37.906016] W
[inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server:
releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318
held by {client=0x7ffb58035bc0, pid=30292
lk-owner=28b07dd5bc55}
[2018-08-27 09:58:37.906034] W
[inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server:
releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318
held by {client=0x7ffb58035bc0, pid=30292
lk-owner=28e0a7d6bc55}
[2018-08-27 09:58:37.906056] W
[inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server:
releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318
held by {client=0x7ffb58035bc0, pid=30292
lk-owner=28b845d8bc55}
[2018-08-27 09:58:37.906079] W
[inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server:
releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318
held by {client=0x7ffb58035bc0, pid=30292
l

Re: [Gluster-devel] [Gluster-users] KVM lockups on Gluster 4.1.1

2018-10-03 Thread Dmitry Melekhov


It doesn't work for some reason:

 gluster volume set pool tcp-user-timeout 42
volume set: failed: option : tcp-user-timeout does not exist
Did you mean tcp-user-timeout?


4.1.5.



03.10.2018 08:30, Dmitry Melekhov пишет:

02.10.2018 12:59, Amar Tumballi пишет:
Recently, in one of the situation, we found that locks were not freed 
up due to not getting TCP timeout..


Can you try the option like below and let us know?

`gluster volume set $volname tcp-user-timeout 42`

(ref: https://review.gluster.org/21170/ )

Regards,
Amar



Thank you, we'll try this.



On Tue, Oct 2, 2018 at 10:40 AM Dmitry Melekhov > wrote:


01.10.2018 23:09, Danny Lee пишет:

Ran into this issue too with 4.1.5 with an arbiter setup.  Also
could not run a statedump due to "Segmentation fault".

Tried with 3.12.13 and had issues with locked files as well.  We
were able to do a statedump and found that some of our files
were "BLOCKED" (xlator.features.locks.vol-locks.inode). Attached
part of statedump.

Also tried clearing the locks using clear-locks, which did
remove the lock, but as soon as I tried to cat the file, it got
locked again and the cat process hung.


I created issue in bugzilla, can't find it though :-(
Looks like there is no activity after I sent all logs...




On Wed, Aug 29, 2018, 3:13 AM Dmitry Melekhov mailto:d...@belkam.com>> wrote:

28.08.2018 10:43, Amar Tumballi пишет:



On Tue, Aug 28, 2018 at 11:24 AM, Dmitry Melekhov
mailto:d...@belkam.com>> wrote:

Hello!


Yesterday we hit something like this on 4.1.2

Centos 7.5.


Volume is replicated - two bricks and one arbiter.


We rebooted arbiter, waited for heal end,  and tried to
live migrate VM to another node ( we run VMs on gluster
nodes ):


[2018-08-27 09:56:22.085411] I [MSGID: 115029]
[server-handshake.c:763:server_setvolume]
0-pool-server: accepted client from

CTX_ID:b55f4a90-e241-48ce-bd4d-268c8a956f4a-GRAPH_ID:0-PID:8887-HOST:son-PC_NAME:pool-
client-6-RECON_NO:-0 (version: 4.1.2)
[2018-08-27 09:56:22.107609] I [MSGID: 115036]
[server.c:483:server_rpc_notify] 0-pool-server:
disconnecting connection from

CTX_ID:b55f4a90-e241-48ce-bd4d-268c8a956f4a-GRAPH_ID:0-PID:8887-HOST:son-PC_NAME:pool-
client-6-RECON_NO:-0
[2018-08-27 09:56:22.107747] I [MSGID: 101055]
[client_t.c:444:gf_client_unref] 0-pool-server:
Shutting down connection

CTX_ID:b55f4a90-e241-48ce-bd4d-268c8a956f4a-GRAPH_ID:0-PID:8887-HOST:son-PC_NAME:pool-clien
t-6-RECON_NO:-0
[2018-08-27 09:58:37.905829] I [MSGID: 115036]
[server.c:483:server_rpc_notify] 0-pool-server:
disconnecting connection from

CTX_ID:c3eb6cfc-2ef9-470a-89d1-a87170d00da5-GRAPH_ID:0-PID:30292-HOST:father-PC_NAME:p
ool-client-6-RECON_NO:-0
[2018-08-27 09:58:37.905926] W
[inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server:
releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318
held by {client=0x7ffb58035bc0, pid=30292
lk-owner=28c831d8bc55}
[2018-08-27 09:58:37.905959] W
[inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server:
releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318
held by {client=0x7ffb58035bc0, pid=30292
lk-owner=2870a7d6bc55}
[2018-08-27 09:58:37.905979] W
[inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server:
releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318
held by {client=0x7ffb58035bc0, pid=30292
lk-owner=2880a7d6bc55}
[2018-08-27 09:58:37.905997] W
[inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server:
releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318
held by {client=0x7ffb58035bc0, pid=30292
lk-owner=28f031d8bc55}
[2018-08-27 09:58:37.906016] W
[inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server:
releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318
held by {client=0x7ffb58035bc0, pid=30292
lk-owner=28b07dd5bc55}
[2018-08-27 09:58:37.906034] W
[inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server:
releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318
held by {client=0x7ffb58035bc0, pid=30292
lk-owner=28e0a7d6bc55}
[2018-08-27 09:58:37.906056] W
[inodelk.c:610:pl_inodelk_log_cleanup] 0-pool-server:
releasing lock on 12172afe-f0a4-4e10-bc0f-c5e4e0d9f318
held by {client=0x7ffb58035bc0, pid=30292
lk-owner=28b845d8bc55}
[2018

Re: [Gluster-devel] [Gluster-users] KVM lockups on Gluster 4.1.1

2018-10-03 Thread Dmitry Melekhov

03.10.2018 10:10, Amar Tumballi пишет:

Sorry! I should have been more specific. I over-looked the option:

---
[root@localhost ~]# gluster volume set demo1 tcp-user-timeout 42
volume set: failed: option : tcp-user-timeout does not exist
Did you mean tcp-user-timeout?
[root@localhost ~]# gluster volume set demo1 *client.tcp-user-timeout* 42
volume set: success
[root@localhost ~]# gluster volume set demo1 *server.tcp-user-timeout* 42
volume set: success

Looks like you need to set the option specifically on client and server.



Thank you very much!
We set this option, hope it will solve our problem.

___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-Maintainers] Memory overwrites due to processing vol files???

2018-10-03 Thread FNU Raghavendra Manjunath
On Fri, Sep 28, 2018 at 4:01 PM Shyam Ranganathan 
wrote:

> We tested with ASAN and without the fix at [1], and it consistently
> crashes at the mdcache xlator when brick mux is enabled.
> On 09/28/2018 03:50 PM, FNU Raghavendra Manjunath wrote:
> >
> > I was looking into the issue and  this is what I could find while
> > working with shyam.
> >
> > There are 2 things here.
> >
> > 1) The multiplexed brick process for the snapshot(s) getting the client
> > volfile (I suspect, it happened
> >  when restore operation was performed).
> > 2) Memory corruption happening while the multiplexed brick process is
> > building the graph (for the client
> >  volfile it got above)
> >
> > I have been able to reproduce the issue in my local computer once, when
> > I ran the testcase tests/bugs/snapshot/bug-1275616.t
> >
> > Upon comparison, we found that the backtrace of the core I got and the
> > core generated in the regression runs was similar.
> > In fact, the victim information shyam mentioned before, is also similar
> > in the core that I was able to get.
> >
> > On top of that, when the brick process was run with valgrind, it
> > reported following memory corruption
> >
> > ==31257== Conditional jump or move depends on uninitialised value(s)
> > ==31257==at 0x1A7D0564: mdc_xattr_list_populate (md-cache.c:3127)
> > ==31257==by 0x1A7D1903: mdc_init (md-cache.c:3486)
> > ==31257==by 0x4E62D41: __xlator_init (xlator.c:684)
> > ==31257==by 0x4E62E67: xlator_init (xlator.c:709)
> > ==31257==by 0x4EB2BEB: glusterfs_graph_init (graph.c:359)
> > ==31257==by 0x4EB37F8: glusterfs_graph_activate (graph.c:722)
> > ==31257==by 0x40AEC3: glusterfs_process_volfp (glusterfsd.c:2528)
> > ==31257==by 0x410868: mgmt_getspec_cbk (glusterfsd-mgmt.c:2076)
> > ==31257==by 0x518408D: rpc_clnt_handle_reply (rpc-clnt.c:755)
> > ==31257==by 0x51845C1: rpc_clnt_notify (rpc-clnt.c:923)
> > ==31257==by 0x518084E: rpc_transport_notify (rpc-transport.c:525)
> > ==31257==by 0x123273DF: socket_event_poll_in (socket.c:2504)
> > ==31257==  Uninitialised value was created by a heap allocation
> > ==31257==at 0x4C2DB9D: malloc (vg_replace_malloc.c:299)
> > ==31257==by 0x4E9F58E: __gf_malloc (mem-pool.c:136)
> > ==31257==by 0x1A7D052A: mdc_xattr_list_populate (md-cache.c:3123)
> > ==31257==by 0x1A7D1903: mdc_init (md-cache.c:3486)
> > ==31257==by 0x4E62D41: __xlator_init (xlator.c:684)
> > ==31257==by 0x4E62E67: xlator_init (xlator.c:709)
> > ==31257==by 0x4EB2BEB: glusterfs_graph_init (graph.c:359)
> > ==31257==by 0x4EB37F8: glusterfs_graph_activate (graph.c:722)
> > ==31257==by 0x40AEC3: glusterfs_process_volfp (glusterfsd.c:2528)
> > ==31257==by 0x410868: mgmt_getspec_cbk (glusterfsd-mgmt.c:2076)
> > ==31257==by 0x518408D: rpc_clnt_handle_reply (rpc-clnt.c:755)
> > ==31257==by 0x51845C1: rpc_clnt_notify (rpc-clnt.c:923)
> >
> > Based on the above observations, I think the below patch  by Shyam
> > should fix the crash.
>
> [1]
>
> > https://review.gluster.org/#/c/glusterfs/+/21299/
> >
> > But, I am still trying understand, why a brick process should get a
> > client volfile (i.e. the 1st issue mentioned above).
> >
>

It was glusterd which was giving the client volfile instead of the brick
volfile.

The following patch has been submitted for review to address the cause of
this problem.

https://review.gluster.org/#/c/glusterfs/+/21314/

Regards,
Raghavendra


> > Regards,
> > Raghavendra
> >
> > On Wed, Sep 26, 2018 at 9:00 PM Shyam Ranganathan  > > wrote:
> >
> > On 09/26/2018 10:21 AM, Shyam Ranganathan wrote:
> > > 2. Testing dashboard to maintain release health (new, thanks Nigel)
> > >   - Dashboard at [2]
> > >   - We already have 3 failures here as follows, needs attention
> from
> > > appropriate *maintainers*,
> > > (a)
> > >
> >
> https://build.gluster.org/job/regression-test-with-multiplex/871/consoleText
> > >   - Failed with core:
> > ./tests/basic/afr/gfid-mismatch-resolution-with-cli.t
> > > (b)
> > >
> >
> https://build.gluster.org/job/regression-test-with-multiplex/873/consoleText
> > >   - Failed with core: ./tests/bugs/snapshot/bug-1275616.t
> > >   - Also test ./tests/bugs/glusterd/validating-server-quorum.t
> > had to be
> > > retried
> >
> > I was looking at the cores from the above 2 instances, the one in job
> > 873 is been a typical pattern, where malloc fails as there is
> internal
> > header corruption in the free bins.
> >
> > When examining the victim that would have been allocated, it is often
> > carrying incorrect size and other magic information. If the data in
> > victim is investigated it looks like a volfile.
> >
> > With the crash in 871, I thought there maybe a point where this is
> > detected earlier, but not able to make headway in the same.
> >
> > So, what could 

[Gluster-devel] Infra Update for the last 2 weeks

2018-10-03 Thread Nigel Babu
Hello folks,

I meant to send this out on Monday, but it's been a busy few days.
* The infra pieces of distributed regression are now complete. A big shout
out to Deepshikha for driving this and Ramky for his help in get this to
completion.
* The GD2 containers and CSI container builds work now. We still don't know
why it broke or why it started working again. We're tracking this in a
bug[1].
* Gluster-Infra now has a Sentry.io account, so we discover issues with
softserve or fstat very quickly and are able to debug it very quickly.
* We're restarting our efforts to get a nightly Glusto job going and are
running into test failures. Currently debugging them for actual failures vs
infra issues.
* The infra team has been assisting gluster-ansible on and off to help them
build out a set of tests. This has been going steady and now waiting on
Infra team to setup CI with Centos-CI team.
* From this sprint on, we're going to be spending some time triaging out
the infra bugs so they're assigned and in the correct state.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1626453

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] POC- Distributed regression testing framework

2018-10-03 Thread Deepshikha Khandelwal
Hello folks,

Distributed-regression job[1] is now a part of Gluster's
nightly-master build pipeline. The following are the issues we have
resolved since we started working on this:

1) Collecting gluster logs from servers.
2) Tests failed due to infra-related issues have been fixed.
3) Time taken to run regression testing reduced to ~50-60 minutes.

To get time down to 40 minutes needs your help!

Currently, there is a test that is failing:

tests/bugs/glusterd/optimized-basic-testcases-in-cluster.t

This needs fixing first.

There's a test that takes 14 minutes to complete -
`tests/bugs/index/bug-1559004-EMLINK-handling.t`. A single test taking
14 minutes is not something we can distribute. Can we look at how we
can speed this up[2]? When this test fails, it is re-attempted,
further increasing the time. This happens in the regular
centos7-regression job as well.

If you see any other issues, please file a bug[3].

[1]: https://build.gluster.org/job/distributed-regression
[2]: https://build.gluster.org/job/distributed-regression/264/console
[3]: 
https://bugzilla.redhat.com/enter_bug.cgi?product=glusterfs&component=project-infrastructure

Thanks,
Deepshikha Khandelwal
On Tue, Jun 26, 2018 at 9:02 AM Nigel Babu  wrote:
>
>
>
> On Mon, Jun 25, 2018 at 7:28 PM Amar Tumballi  wrote:
>>
>>
>>
>>> There are currently a few known issues:
>>> * Not collecting the entire logs (/var/log/glusterfs) from servers.
>>
>>
>> If I look at the activities involved with regression failures, this can wait.
>
>
> Well, we can't debug the current failures without having the logs. So this 
> has to be fixed first.
>
>>
>>
>>>
>>> * A few tests fail due to infra-related issues like geo-rep tests.
>>
>>
>> Please open bugs for this, so we can track them, and take it to closure.
>
>
> These are failing due to infra reasons. Most likely subtle differences in the 
> setup of these nodes vs our normal nodes. We'll only be able to debug them 
> once we get the logs. I know the geo-rep ones are easy to fix. The playbook 
> for setting up geo-rep correctly just didn't make it over to the playbook 
> used for these images.
>
>>
>>
>>>
>>> * Takes ~80 minutes with 7 distributed servers (targetting 60 minutes)
>>
>>
>> Time can change with more tests added, and also please plan to have number 
>> of server as 1 to n.
>
>
> While the n is configurable, however it will be fixed to a single digit 
> number for now. We will need to place *some* limitation somewhere or else 
> we'll end up not being able to control our cloud bills.
>
>>
>>
>>>
>>> * We've only tested plain regressions. ASAN and Valgrind are currently 
>>> untested.
>>
>>
>> Great to have it running not 'per patch', but as nightly, or weekly to start 
>> with.
>
>
> This is currently not targeted until we phase out current regressions.
>
>>>
>>>
>>> Before bringing it into production, we'll run this job nightly and
>>> watch it for a month to debug the other failures.
>>>
>>
>> I would say, bring it to production sooner, say 2 weeks, and also plan to 
>> have the current regression as is with a special command like 'run 
>> regression in-one-machine' in gerrit (or something similar) with voting 
>> rights, so we can fall back to this method if something is broken in 
>> parallel testing.
>>
>> I have seen that regardless of amount of time we put some scripts in 
>> testing, the day we move to production, some thing would be broken. So, let 
>> that happen earlier than later, so it would help next release branching out. 
>> Don't want to be stuck for branching due to infra failures.
>
>
> Having two regression jobs that can vote is going to cause more confusion 
> than it's worth. There are a couple of intermittent memory issues with the 
> test script that we need to debug and fix before I'm comfortable in making 
> this job a voting job. We've worked around these problems right now. It still 
> pops up now and again. The fact that things break often is not an excuse to 
> prevent avoidable failures.  The one month timeline was taken with all these 
> factors into consideration. The 2-week timeline is a no-go at this point.
>
> When we are ready to make the switch, we won't be switching 100% of the job. 
> We'll start with a sliding scale so that we can monitor failures and machine 
> creation adequately.
>
> --
> nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel