Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-17 Thread Erik Jacobson
Amar, Ravi -

> This thread has been one of the largest effort to stabilize the systems in
> recent times.

Well thanks to you guys too. It would have been easy to stop replying
when things got hard. I understand best effort community support and
appreciate you sticking with us.

The test system I had is disappearing on Monday. However, a larger test
system will be less booked after a release finalizes. So I have a
test platform through early next week, and will again have something in a
couple weeks. I also may have a window at 1k nodes at a customer site
during a maintenance window... And we should have a couple big ones
going through the factory in the comings weeks in the 1k size. At 1k
nodes, we have 3 gluster servers.

THANKS AGAIN. Wow what a relief.

Let me get these changes checked in so I can get it to some customers and
then look at getting a new thread going on the thread hangs.

Erik


> 
> Thanks for patience and number of retries you did, Erik!
> 
> Thanks indeed! Once https://review.gluster.org/#/c/glusterfs/+/24316/ gets
> merged on master, I will back port it to the release branches.
> 
> 
> We surely need to get to the glitch you found with the 7.4 version, as 
> with
> every higher version, we expect more stability!
> 
> True, maybe we should start a separate thread...
> 
> Regards,
> Ravi
> 
> Regards,
> Amar
> 
> On Fri, Apr 17, 2020 at 2:46 AM Erik Jacobson 
> wrote:
> 
> I have some news.
> 
> After many many many trials, reboots of gluster servers, reboots of
> nodes...
> in what should have reproduced the issue several times. I think we're
> stable.
> 
> It appears this glusterfs nfs daemon hang only happens in glusterfs74
> and not 72.
> 
> So
> 1) Your split-brain patch
> 2) performance.parallel-readdir off
> 3) glusterfs72
> 
> I declare it stable. I can't make it fail: split-brain, hang, noor seg
> fault
> with one leader down.
> 
> I'm working on putting this in to a SW update.
> 
> We are going to test if performance.parallel-readdir off impacts
> booting
> at scale but we don't have a system to try it on at this time.
> 
> THAK YOU!
> 
> I may have access to the 57 node test system if there is something
> you'd
> like me to try with regards to why glusterfs74 is unstable in this
> situation. Just let me know.
> 
> Erik
> 
> On Thu, Apr 16, 2020 at 12:03:33PM -0500, Erik Jacobson wrote:
> > So in my test runs since making that change, we have a different odd
> > behavior now. As you recall, this is with your patch -- still not
> > split-brain -- and now with performance.parallel-readdir off
> >
> > The NFS server grinds to a hault after a few test runs. It does not
> core
> > dump.
> >
> > All that shows up in the log is:
> >
> > "pending frames:" with nothing after it and no date stamp.
> >
> > I will start looking for interesting break points I guess.
> >
> >
> > The glusterfs for nfs is still alive:
> >
> > root 30541 1 42 09:57 ?00:51:06 /usr/sbin/glusterfs
> -s localhost --volfile-id gluster/nfs -p /var/run/gluster/nfs/nfs.pid
> -l /var/log/glusterfs/nfs.log -S /var/run/gluster/
> 9ddb5561058ff543.socket
> >
> >
> >
> > [root@leader3 ~]# strace -f  -p 30541
> > strace: Process 30541 attached with 40 threads
> > [pid 30580] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30579] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30578] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30577] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30576] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30575] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30574] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30573] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30572] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30571] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30570] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30569] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30568] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30567] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30566] futex(0x7f88b820, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30565] futex(0x7f8904035f60, 

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-16 Thread Ravishankar N


On 17/04/20 10:35 am, Amar Tumballi wrote:
This thread has been one of the largest effort to stabilize the 
systems in recent times.


Thanks for patience and number of retries you did, Erik!
Thanks indeed! Once https://review.gluster.org/#/c/glusterfs/+/24316/ 
gets merged on master, I will back port it to the release branches.


We surely need to get to the glitch you found with the 7.4 version, as 
with every higher version, we expect more stability!


True, maybe we should start a separate thread...

Regards,
Ravi

Regards,
Amar

On Fri, Apr 17, 2020 at 2:46 AM Erik Jacobson > wrote:


I have some news.

After many many many trials, reboots of gluster servers, reboots
of nodes...
in what should have reproduced the issue several times. I think we're
stable.

It appears this glusterfs nfs daemon hang only happens in glusterfs74
and not 72.

So
1) Your split-brain patch
2) performance.parallel-readdir off
3) glusterfs72

I declare it stable. I can't make it fail: split-brain, hang, noor
seg fault
with one leader down.

I'm working on putting this in to a SW update.

We are going to test if performance.parallel-readdir off impacts
booting
at scale but we don't have a system to try it on at this time.

THAK YOU!

I may have access to the 57 node test system if there is something
you'd
like me to try with regards to why glusterfs74 is unstable in this
situation. Just let me know.

Erik

On Thu, Apr 16, 2020 at 12:03:33PM -0500, Erik Jacobson wrote:
> So in my test runs since making that change, we have a different odd
> behavior now. As you recall, this is with your patch -- still not
> split-brain -- and now with performance.parallel-readdir off
>
> The NFS server grinds to a hault after a few test runs. It does
not core
> dump.
>
> All that shows up in the log is:
>
> "pending frames:" with nothing after it and no date stamp.
>
> I will start looking for interesting break points I guess.
>
>
> The glusterfs for nfs is still alive:
>
> root     30541     1 42 09:57 ?        00:51:06
/usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p
/var/run/gluster/nfs/nfs.pid -l /var/log/glusterfs/nfs.log -S
/var/run/gluster/9ddb5561058ff543.socket
>
>
>
> [root@leader3 ~]# strace -f  -p 30541
> strace: Process 30541 attached with 40 threads
> [pid 30580] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30579] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30578] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30577] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30576] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30575] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30574] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30573] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30572] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30571] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30570] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30569] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30568] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30567] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30566] futex(0x7f88b820, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30565] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30564] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30563] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30562] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30561] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30560] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30559] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30558] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30557] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30556] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30555] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30554] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30553] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30552] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30551] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30550] restart_syscall(<... resuming interrupted
restart_syscall ...> 
> [pid 30549] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30548] futex(0x7f88b820, FUTEX_WAIT_PRIVATE, 2, NULL

> [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=0,

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-16 Thread Amar Tumballi
This thread has been one of the largest effort to stabilize the systems in
recent times.

Thanks for patience and number of retries you did, Erik!

We surely need to get to the glitch you found with the 7.4 version, as with
every higher version, we expect more stability!

Regards,
Amar

On Fri, Apr 17, 2020 at 2:46 AM Erik Jacobson  wrote:

> I have some news.
>
> After many many many trials, reboots of gluster servers, reboots of
> nodes...
> in what should have reproduced the issue several times. I think we're
> stable.
>
> It appears this glusterfs nfs daemon hang only happens in glusterfs74
> and not 72.
>
> So
> 1) Your split-brain patch
> 2) performance.parallel-readdir off
> 3) glusterfs72
>
> I declare it stable. I can't make it fail: split-brain, hang, noor seg
> fault
> with one leader down.
>
> I'm working on putting this in to a SW update.
>
> We are going to test if performance.parallel-readdir off impacts booting
> at scale but we don't have a system to try it on at this time.
>
> THAK YOU!
>
> I may have access to the 57 node test system if there is something you'd
> like me to try with regards to why glusterfs74 is unstable in this
> situation. Just let me know.
>
> Erik
>
> On Thu, Apr 16, 2020 at 12:03:33PM -0500, Erik Jacobson wrote:
> > So in my test runs since making that change, we have a different odd
> > behavior now. As you recall, this is with your patch -- still not
> > split-brain -- and now with performance.parallel-readdir off
> >
> > The NFS server grinds to a hault after a few test runs. It does not core
> > dump.
> >
> > All that shows up in the log is:
> >
> > "pending frames:" with nothing after it and no date stamp.
> >
> > I will start looking for interesting break points I guess.
> >
> >
> > The glusterfs for nfs is still alive:
> >
> > root 30541 1 42 09:57 ?00:51:06 /usr/sbin/glusterfs -s
> localhost --volfile-id gluster/nfs -p /var/run/gluster/nfs/nfs.pid -l
> /var/log/glusterfs/nfs.log -S /var/run/gluster/9ddb5561058ff543.socket
> >
> >
> >
> > [root@leader3 ~]# strace -f  -p 30541
> > strace: Process 30541 attached with 40 threads
> > [pid 30580] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30579] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30578] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30577] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30576] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30575] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30574] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30573] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30572] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30571] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30570] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30569] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30568] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30567] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30566] futex(0x7f88b820, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30565] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30564] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30563] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30562] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30561] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30560] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30559] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30558] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30557] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30556] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30555] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30554] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30553] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30552] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30551] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30550] restart_syscall(<... resuming interrupted restart_syscall
> ...> 
> > [pid 30549] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30548] futex(0x7f88b820, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=243775}
> 
> > [pid 30546] restart_syscall(<... resuming interrupted restart_syscall
> ...> 
> > [pid 30545] restart_syscall(<... resuming interrupted restart_syscall
> ...> 
> > [pid 30544] futex(0x7f88b820, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30543] rt_sigtimedwait([HUP INT USR1 USR2 TERM],  
> > [pid 30542] futex(0x7f88b820, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30541] futex(0x7f890c3a39d0, FUTEX_WAIT, 30548, NULL  ...>
> > [pid 30547] <... select resumed> )  = 0 (Timeout)
> > [pid 30547] 

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-16 Thread Erik Jacobson
I have some news.

After many many many trials, reboots of gluster servers, reboots of nodes...
in what should have reproduced the issue several times. I think we're
stable.

It appears this glusterfs nfs daemon hang only happens in glusterfs74
and not 72.

So
1) Your split-brain patch
2) performance.parallel-readdir off
3) glusterfs72

I declare it stable. I can't make it fail: split-brain, hang, noor seg fault
with one leader down.

I'm working on putting this in to a SW update.

We are going to test if performance.parallel-readdir off impacts booting
at scale but we don't have a system to try it on at this time.

THAK YOU!

I may have access to the 57 node test system if there is something you'd
like me to try with regards to why glusterfs74 is unstable in this
situation. Just let me know.

Erik

On Thu, Apr 16, 2020 at 12:03:33PM -0500, Erik Jacobson wrote:
> So in my test runs since making that change, we have a different odd
> behavior now. As you recall, this is with your patch -- still not
> split-brain -- and now with performance.parallel-readdir off
> 
> The NFS server grinds to a hault after a few test runs. It does not core
> dump.
> 
> All that shows up in the log is:
> 
> "pending frames:" with nothing after it and no date stamp.
> 
> I will start looking for interesting break points I guess.
> 
> 
> The glusterfs for nfs is still alive:
> 
> root 30541 1 42 09:57 ?00:51:06 /usr/sbin/glusterfs -s 
> localhost --volfile-id gluster/nfs -p /var/run/gluster/nfs/nfs.pid -l 
> /var/log/glusterfs/nfs.log -S /var/run/gluster/9ddb5561058ff543.socket
> 
> 
> 
> [root@leader3 ~]# strace -f  -p 30541
> strace: Process 30541 attached with 40 threads
> [pid 30580] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30579] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30578] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30577] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30576] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30575] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30574] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30573] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30572] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30571] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30570] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30569] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30568] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30567] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30566] futex(0x7f88b820, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30565] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30564] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30563] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30562] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30561] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30560] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30559] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30558] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30557] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30556] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30555] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30554] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30553] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30552] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30551] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30550] restart_syscall(<... resuming interrupted restart_syscall ...> 
> 
> [pid 30549] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30548] futex(0x7f88b820, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=243775} 
> 
> [pid 30546] restart_syscall(<... resuming interrupted restart_syscall ...> 
> 
> [pid 30545] restart_syscall(<... resuming interrupted restart_syscall ...> 
> 
> [pid 30544] futex(0x7f88b820, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30543] rt_sigtimedwait([HUP INT USR1 USR2 TERM],  
> [pid 30542] futex(0x7f88b820, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30541] futex(0x7f890c3a39d0, FUTEX_WAIT, 30548, NULL 
> [pid 30547] <... select resumed> )  = 0 (Timeout)
> [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
> [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
> [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
> [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
> [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
> [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
> [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
> [pid 30547] select(0, NULL, NULL, NULL, 

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-16 Thread Erik Jacobson
So in my test runs since making that change, we have a different odd
behavior now. As you recall, this is with your patch -- still not
split-brain -- and now with performance.parallel-readdir off

The NFS server grinds to a hault after a few test runs. It does not core
dump.

All that shows up in the log is:

"pending frames:" with nothing after it and no date stamp.

I will start looking for interesting break points I guess.


The glusterfs for nfs is still alive:

root 30541 1 42 09:57 ?00:51:06 /usr/sbin/glusterfs -s 
localhost --volfile-id gluster/nfs -p /var/run/gluster/nfs/nfs.pid -l 
/var/log/glusterfs/nfs.log -S /var/run/gluster/9ddb5561058ff543.socket



[root@leader3 ~]# strace -f  -p 30541
strace: Process 30541 attached with 40 threads
[pid 30580] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30579] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30578] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30577] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30576] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30575] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30574] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30573] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30572] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30571] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30570] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30569] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30568] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30567] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30566] futex(0x7f88b820, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30565] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30564] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30563] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30562] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30561] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30560] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30559] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30558] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30557] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30556] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30555] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30554] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30553] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30552] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30551] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30550] restart_syscall(<... resuming interrupted restart_syscall ...> 

[pid 30549] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30548] futex(0x7f88b820, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=243775} 
[pid 30546] restart_syscall(<... resuming interrupted restart_syscall ...> 

[pid 30545] restart_syscall(<... resuming interrupted restart_syscall ...> 

[pid 30544] futex(0x7f88b820, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30543] rt_sigtimedwait([HUP INT USR1 USR2 TERM],  
[pid 30542] futex(0x7f88b820, FUTEX_WAIT_PRIVATE, 2, NULL 
[pid 30541] futex(0x7f890c3a39d0, FUTEX_WAIT, 30548, NULL 
[pid 30547] <... select resumed> )  = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}^Cstrace: Process 
30541 detached
strace: Process 30542 detached
strace: Process 30543 detached
strace: Process 30544 detached
strace: Process 30545 detached
strace: Process 30546 detached
strace: Process 30547 detached
 
strace: Process 30548 detached
strace: Process 30549 detached
strace: Process 30550 detached
strace: Process 30551 detached
strace: Process 30552 detached
strace: Process 30553 detached
strace: Process 30554 detached
strace: Process 30555 detached
strace: Process 30556 detached
strace: Process 30557 detached
strace: Process 30558 

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-16 Thread Ravishankar N



On 16/04/20 8:04 pm, Erik Jacobson wrote:

Quick update just on how this got set.

gluster volume set cm_shared performance.parallel-readdir on

Is something we did turn on, thinking it might make our NFS services
faster and not knowing about it possibly being negative.

Below is a diff of the nfs volume file ON vs OFF. So I will simply turn
this OFF and do a test run.
Yes,that should do it. I am not sure if performance.parallel-readdir was 
intentionally made to have an effect on gnfs volfiles. Usually, for 
other performance xlators, `gluster volume set` only changes the fuse 
volfile.






Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-16 Thread Erik Jacobson
Quick update just on how this got set.

gluster volume set cm_shared performance.parallel-readdir on

Is something we did turn on, thinking it might make our NFS services
faster and not knowing about it possibly being negative.

Below is a diff of the nfs volume file ON vs OFF. So I will simply turn
this OFF and do a test run. Does this look correct? I will start testing
with this turned OFF. Thank you!

[root@leader1 nfs]# diff -u /tmp/nfs-server.vol-ORIG nfs-server.vol
--- /tmp/nfs-server.vol-ORIG2020-04-16 09:28:56.855309870 -0500
+++ nfs-server.vol  2020-04-16 09:29:14.267289600 -0500
@@ -60,21 +60,13 @@
 subvolumes cm_shared-client-0 cm_shared-client-1 cm_shared-client-2
 end-volume

-volume cm_shared-readdir-ahead-0
-type performance/readdir-ahead
-option rda-cache-limit 10MB
-option rda-request-size 131072
-option parallel-readdir on
-subvolumes cm_shared-replicate-0
-end-volume
-
 volume cm_shared-dht
 type cluster/distribute
 option force-migration off
 option lock-migration off
 option lookup-optimize on
 option lookup-unhashed auto
-subvolumes cm_shared-readdir-ahead-0
+subvolumes cm_shared-replicate-0
 end-volume

 volume cm_shared-utime


On Thu, Apr 16, 2020 at 06:58:01PM +0530, Ravishankar N wrote:
> 
> On 16/04/20 6:54 pm, Erik Jacobson wrote:
> > > The patch by itself is only making changes specific to AFR, so it should 
> > > not
> > > affect other translators. But I wonder how readdir-ahead is enabled in 
> > > your
> > > gnfs stack. All performance xlators are turned off in gnfs except
> > > write-behind and AFAIK, there is no way to enable them via the CLI. Did 
> > > you
> > > custom edit your gnfs volfile to add readdir-ahead? If yes, does the crash
> > > go-away if you remove the xlator from the nfs volfile?
> > thank you. A quick reply. I will then go research how to do this,
> > I've never hand edited a volume before. I've never even really looked at
> > the gnfs volfile before.
> > 
> > There are no custom code changes or hand edits.
> > 
> > More soon.
> > 
> Okay, /var/lib/glusterd/nfs/nfs-server.vol is the file you want to look at
> if you are using gnfs.
> 
> -Ravi






Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-16 Thread Ravishankar N



On 16/04/20 6:54 pm, Erik Jacobson wrote:

The patch by itself is only making changes specific to AFR, so it should not
affect other translators. But I wonder how readdir-ahead is enabled in your
gnfs stack. All performance xlators are turned off in gnfs except
write-behind and AFAIK, there is no way to enable them via the CLI. Did you
custom edit your gnfs volfile to add readdir-ahead? If yes, does the crash
go-away if you remove the xlator from the nfs volfile?

thank you. A quick reply. I will then go research how to do this,
I've never hand edited a volume before. I've never even really looked at
the gnfs volfile before.

There are no custom code changes or hand edits.

More soon.

Okay, /var/lib/glusterd/nfs/nfs-server.vol is the file you want to look 
at if you are using gnfs.


-Ravi





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-16 Thread Erik Jacobson
> The patch by itself is only making changes specific to AFR, so it should not
> affect other translators. But I wonder how readdir-ahead is enabled in your
> gnfs stack. All performance xlators are turned off in gnfs except
> write-behind and AFAIK, there is no way to enable them via the CLI. Did you
> custom edit your gnfs volfile to add readdir-ahead? If yes, does the crash
> go-away if you remove the xlator from the nfs volfile?

thank you. A quick reply. I will then go research how to do this,
I've never hand edited a volume before. I've never even really looked at
the gnfs volfile before.

There are no custom code changes or hand edits.

More soon.




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-16 Thread Ravishankar N
The patch by itself is only making changes specific to AFR, so it should 
not affect other translators. But I wonder how readdir-ahead is enabled 
in your gnfs stack. All performance xlators are turned off in gnfs 
except write-behind and AFAIK, there is no way to enable them via the 
CLI. Did you custom edit your gnfs volfile to add readdir-ahead? If yes, 
does the crash go-away if you remove the xlator from the nfs volfile?


Regards,
Ravi
On 16/04/20 8:47 am, Erik Jacobson wrote:

It is important to note that our testing has shown zero split-brain
errors since the patch... And that it is significantly harder to
hit the seg fault than it was to hit split-brain before. It's still
sufficiently frequent that we can't let it out the door.  In my intensive
test case (found elsewhere in the thread), it would 100% hit the problem
with 57 nodes every time at least once. With the patch, zero split
brain, but maybe 1 in 4 runs would seg fault. We didn't have a seg
fault problem previously. This is all within the context of 1 of the 3
servers in the subvolume being down. I hit the seg fault once with just
57 nodes booting (using NFS for their root FS) and no other load.


Scott was able to take an analysis pass. Any suggestions? his words
follow:


The segfault appears to occur in read-ahead functionality.  We will keep
the core in case it needs to be looked at again, being sure to copy off
all necessary metadata to maintain adequate symbol lookup within gdb.
It may also be possible to breakpoint immediately prior to the segfault,
but setting the right conditions may prove to be difficult.

A bit of analysis:

Prior to the segfault, the op_errno field in a struct rda_fd_ctx packet
shows an ENOENT error.  The packet is from the call_frame_t parameter of
rda_fill_fd_cbk() (Backtrace #2)  The following shows the progression
from the call_frame_t parameter to the op_errno field of the rda_fd_ctx
structure.

(gdb) print {call_frame_t}0x7fe5acf18eb8
$26 = {root = 0x7fe5ac6d65f8, parent = 0x0, frames = {next =
0x7fe5ac6d6cf0, prev = 0x7fe5ac096298}, local = 0x7fe5ac1dbc78,
    this = 0x7fe63c0162b0, ret = 0x0, ref_count = 0, lock = {spinlock =
0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0,
      __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list =
{__prev = 0x0, __next = 0x0}}, __size = '\000' ,
    __align = 0}}, cookie = 0x0, complete = false, op = GF_FOP_NULL,
begin = {tv_sec = 4234, tv_nsec = 637078332}, end = {tv_sec = 4234,
      tv_nsec = 803882781}, wind_from = 0x0, wind_to = 0x0, unwind_from =
0x0, unwind_to = 0x0}

(gdb) print {struct rda_local}0x7fe5ac1dbc78
$27 = {ctx = 0x7fe5ace46590, fd = 0x7fe60433d8b8, xattrs = 0x0, inode =
0x0, offset = 0, generation = 0, skip_dir = 0}

(gdb) print {struct rda_fd_ctx}0x7fe5ace46590
$28 = {cur_offset = 0, cur_size = 638, next_offset = 1538, state = 36,
lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0,
      __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision =
0, __list = {__prev = 0x0, __next = 0x0}},
    __size = '\000' , __align = 0}}, entries =
{{list = {next = 0x7fe60cda5f90, prev = 0x7fe60ca08190}, {
      next = 0x7fe60cda5f90, prev = 0x7fe60ca08190}}, d_ino = 0,
d_off = 0, d_len = 0, d_type = 0, d_stat = {ia_flags = 0, ia_ino = 0,
    ia_dev = 0, ia_rdev = 0, ia_size = 0, ia_nlink = 0, ia_uid = 0,
ia_gid = 0, ia_blksize = 0, ia_blocks = 0, ia_atime = 0,
    ia_mtime = 0, ia_ctime = 0, ia_btime = 0, ia_atime_nsec = 0,
ia_mtime_nsec = 0, ia_ctime_nsec = 0, ia_btime_nsec = 0,
    ia_attributes = 0, ia_attributes_mask = 0, ia_gfid = '\000'
, ia_type = IA_INVAL, ia_prot = {suid = 0 '\000',
      sgid = 0 '\000', sticky = 0 '\000', owner = {read = 0 '\000',
write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000',
    write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000',
write = 0 '\000', exec = 0 '\000'}}}, dict = 0x0, inode = 0x0,
      d_name = 0x7fe5ace466a8 ""}, fill_frame = 0x0, stub = 0x0, op_errno
= 2, xattrs = 0x0, writes_during_prefetch = 0x0, prefetching = {
      lk = 0x7fe5ace466d0 "", value = 0}}

The segfault occurs at the bottom of rda_fill_fd_cbk() where the rpc
call stack frames are being destroyed.  The following are what I believe
to be the three frames that are intended to be destroyed, but it is
unclear which packet is causing the problem.  If I were to dig more into
this, I will use ddd (graphical debugger).  It's been a while since I've
done low level debugging like this, so I'm a bit rusty.

(gdb) print {call_frame_t}0x7fe5acf18eb8
$34 = {root = 0x7fe5ac6d65f8, parent = 0x0, frames = {next =
0x7fe5ac6d6cf0, prev = 0x7fe5ac096298}, local = 0x7fe5ac1dbc78,
    this = 0x7fe63c0162b0, ret = 0x0, ref_count = 0, lock = {spinlock =
0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0,
      __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list =
{__prev = 0x0, __next = 0x0}}, __size = '\000' ,
    __align = 0}}, cookie = 0x0, 

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-15 Thread Erik Jacobson
It is important to note that our testing has shown zero split-brain
errors since the patch... And that it is significantly harder to
hit the seg fault than it was to hit split-brain before. It's still
sufficiently frequent that we can't let it out the door.  In my intensive
test case (found elsewhere in the thread), it would 100% hit the problem
with 57 nodes every time at least once. With the patch, zero split
brain, but maybe 1 in 4 runs would seg fault. We didn't have a seg
fault problem previously. This is all within the context of 1 of the 3
servers in the subvolume being down. I hit the seg fault once with just
57 nodes booting (using NFS for their root FS) and no other load.


Scott was able to take an analysis pass. Any suggestions? his words
follow:


The segfault appears to occur in read-ahead functionality.  We will keep 
the core in case it needs to be looked at again, being sure to copy off 
all necessary metadata to maintain adequate symbol lookup within gdb.  
It may also be possible to breakpoint immediately prior to the segfault, 
but setting the right conditions may prove to be difficult.

A bit of analysis:

Prior to the segfault, the op_errno field in a struct rda_fd_ctx packet 
shows an ENOENT error.  The packet is from the call_frame_t parameter of 
rda_fill_fd_cbk() (Backtrace #2)  The following shows the progression 
from the call_frame_t parameter to the op_errno field of the rda_fd_ctx 
structure.

(gdb) print {call_frame_t}0x7fe5acf18eb8
$26 = {root = 0x7fe5ac6d65f8, parent = 0x0, frames = {next = 
0x7fe5ac6d6cf0, prev = 0x7fe5ac096298}, local = 0x7fe5ac1dbc78,
   this = 0x7fe63c0162b0, ret = 0x0, ref_count = 0, lock = {spinlock = 
0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0,
     __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = 
{__prev = 0x0, __next = 0x0}}, __size = '\000' ,
   __align = 0}}, cookie = 0x0, complete = false, op = GF_FOP_NULL, 
begin = {tv_sec = 4234, tv_nsec = 637078332}, end = {tv_sec = 4234,
     tv_nsec = 803882781}, wind_from = 0x0, wind_to = 0x0, unwind_from = 
0x0, unwind_to = 0x0}

(gdb) print {struct rda_local}0x7fe5ac1dbc78
$27 = {ctx = 0x7fe5ace46590, fd = 0x7fe60433d8b8, xattrs = 0x0, inode = 
0x0, offset = 0, generation = 0, skip_dir = 0}

(gdb) print {struct rda_fd_ctx}0x7fe5ace46590
$28 = {cur_offset = 0, cur_size = 638, next_offset = 1538, state = 36, 
lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0,
     __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 
0, __list = {__prev = 0x0, __next = 0x0}},
   __size = '\000' , __align = 0}}, entries = 
{{list = {next = 0x7fe60cda5f90, prev = 0x7fe60ca08190}, {
     next = 0x7fe60cda5f90, prev = 0x7fe60ca08190}}, d_ino = 0, 
d_off = 0, d_len = 0, d_type = 0, d_stat = {ia_flags = 0, ia_ino = 0,
   ia_dev = 0, ia_rdev = 0, ia_size = 0, ia_nlink = 0, ia_uid = 0, 
ia_gid = 0, ia_blksize = 0, ia_blocks = 0, ia_atime = 0,
   ia_mtime = 0, ia_ctime = 0, ia_btime = 0, ia_atime_nsec = 0, 
ia_mtime_nsec = 0, ia_ctime_nsec = 0, ia_btime_nsec = 0,
   ia_attributes = 0, ia_attributes_mask = 0, ia_gfid = '\000' 
, ia_type = IA_INVAL, ia_prot = {suid = 0 '\000',
     sgid = 0 '\000', sticky = 0 '\000', owner = {read = 0 '\000', 
write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000',
   write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', 
write = 0 '\000', exec = 0 '\000'}}}, dict = 0x0, inode = 0x0,
     d_name = 0x7fe5ace466a8 ""}, fill_frame = 0x0, stub = 0x0, op_errno 
= 2, xattrs = 0x0, writes_during_prefetch = 0x0, prefetching = {
     lk = 0x7fe5ace466d0 "", value = 0}}

The segfault occurs at the bottom of rda_fill_fd_cbk() where the rpc 
call stack frames are being destroyed.  The following are what I believe 
to be the three frames that are intended to be destroyed, but it is 
unclear which packet is causing the problem.  If I were to dig more into 
this, I will use ddd (graphical debugger).  It's been a while since I've 
done low level debugging like this, so I'm a bit rusty.

(gdb) print {call_frame_t}0x7fe5acf18eb8
$34 = {root = 0x7fe5ac6d65f8, parent = 0x0, frames = {next = 
0x7fe5ac6d6cf0, prev = 0x7fe5ac096298}, local = 0x7fe5ac1dbc78,
   this = 0x7fe63c0162b0, ret = 0x0, ref_count = 0, lock = {spinlock = 
0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0,
     __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = 
{__prev = 0x0, __next = 0x0}}, __size = '\000' ,
   __align = 0}}, cookie = 0x0, complete = false, op = GF_FOP_NULL, 
begin = {tv_sec = 4234, tv_nsec = 637078332}, end = {tv_sec = 4234,
     tv_nsec = 803882781}, wind_from = 0x0, wind_to = 0x0, unwind_from = 
0x0, unwind_to = 0x0}
(gdb) print {call_frame_t}0x7fe5ac6d6ce0
$35 = {root = 0x0, parent = 0x563f5a955920, frames = {next = 
0x7fe5ac096298, prev = 0x7fe5acf18ec8}, local = 0x0, this = 0x108a,
   ret = 0x25f90b3c, ref_count = 0, lock = {spinlock = 0, mutex = 
{__data = {__lock = 0, __count = 0, 

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-15 Thread Erik Jacobson
> The new split-brain issue is much harder to reproduce, but after several

(correcting to say new seg fault issue, the split brain is gone!!)

> intense runs, it usually hits once.
> 
> We switched to pure gluster74 plus your patch so we're apples to apples
> now.
> 
> I'm going to see if Scott can help debug it.
> 
> Here is the back trace info from the core dump:
> 
> -rw-r-  1 root root 1.9G Apr 15 12:40 
> core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.158697232400
> -rw-r-  1 root root 221M Apr 15 12:40 
> core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.158697232400.lz4
> drwxrwxrwt  9 root root  20K Apr 15 12:40 .
> [root@leader3 tmp]#
> [root@leader3 tmp]#
> [root@leader3 tmp]# gdb 
> core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.158697232400
> GNU gdb (GDB) Red Hat Enterprise Linux 8.2-5.el8
> Copyright (C) 2018 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.
> Type "show copying" and "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> .
> Find the GDB manual and other documentation resources online at:
> .
> 
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> [New LWP 61102]
> [New LWP 61085]
> [New LWP 61087]
> [New LWP 61117]
> [New LWP 61086]
> [New LWP 61108]
> [New LWP 61089]
> [New LWP 61090]
> [New LWP 61121]
> [New LWP 61088]
> [New LWP 61091]
> [New LWP 61093]
> [New LWP 61095]
> [New LWP 61092]
> [New LWP 61094]
> [New LWP 61098]
> [New LWP 61096]
> [New LWP 61097]
> [New LWP 61084]
> [New LWP 61100]
> [New LWP 61103]
> [New LWP 61104]
> [New LWP 61099]
> [New LWP 61105]
> [New LWP 61101]
> [New LWP 61106]
> [New LWP 61109]
> [New LWP 61107]
> [New LWP 61112]
> [New LWP 61119]
> [New LWP 61110]
> [New LWP 6]
> [New LWP 61118]
> [New LWP 61123]
> [New LWP 61122]
> [New LWP 61113]
> [New LWP 61114]
> [New LWP 61120]
> [New LWP 61116]
> [New LWP 61115]
> 
> warning: core file may not match specified executable file.
> Reading symbols from /usr/sbin/glusterfsd...Reading symbols from 
> /usr/lib/debug/usr/sbin/glusterfsd-7.4-1.el8722.0800.200415T1052.a.rhel8hpeerikj.x86_64.debug...done.
> done.
> 
> warning: Ignoring non-absolute filename: 
> Missing separate debuginfo for linux-vdso.so.1
> Try: dnf --enablerepo='*debug*' install 
> /usr/lib/debug/.build-id/06/44254f9cbaa826db070a796046026adba58266
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> 
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id 
> gluster/nfs -p /var/run/gluster/n'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x7fe63bb5d7bb in FRAME_DESTROY (frame=0x7fe5ac096288)
> at ../../../../libglusterfs/src/glusterfs/stack.h:193
> 193   FRAME_DESTROY(frame);
> [Current thread is 1 (Thread 0x7fe617fff700 (LWP 61102))]
> Missing separate debuginfos, use: dnf debuginfo-install 
> glibc-2.28-42.el8.x86_64 keyutils-libs-1.5.10-6.el8.x86_64 
> krb5-libs-1.16.1-22.el8.x86_64 libacl-2.2.53-1.el8.x86_64 
> libattr-2.4.48-3.el8.x86_64 libcom_err-1.44.3-2.el8.x86_64 
> libgcc-8.2.1-3.5.el8.x86_64 libselinux-2.8-6.el8.x86_64 
> libtirpc-1.1.4-3.el8.x86_64 libuuid-2.32.1-8.el8.x86_64 
> openssl-libs-1.1.1-8.el8.x86_64 pcre2-10.32-1.el8.x86_64 
> zlib-1.2.11-10.el8.x86_64
> (gdb) bt
> #0  0x7fe63bb5d7bb in FRAME_DESTROY (frame=0x7fe5ac096288)
> at ../../../../libglusterfs/src/glusterfs/stack.h:193
> #1  

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-15 Thread Erik Jacobson
The new split-brain issue is much harder to reproduce, but after several
intense runs, it usually hits once.

We switched to pure gluster74 plus your patch so we're apples to apples
now.

I'm going to see if Scott can help debug it.

Here is the back trace info from the core dump:

-rw-r-  1 root root 1.9G Apr 15 12:40 
core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.158697232400
-rw-r-  1 root root 221M Apr 15 12:40 
core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.158697232400.lz4
drwxrwxrwt  9 root root  20K Apr 15 12:40 .
[root@leader3 tmp]#
[root@leader3 tmp]#
[root@leader3 tmp]# gdb 
core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.158697232400
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-5.el8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
[New LWP 61102]
[New LWP 61085]
[New LWP 61087]
[New LWP 61117]
[New LWP 61086]
[New LWP 61108]
[New LWP 61089]
[New LWP 61090]
[New LWP 61121]
[New LWP 61088]
[New LWP 61091]
[New LWP 61093]
[New LWP 61095]
[New LWP 61092]
[New LWP 61094]
[New LWP 61098]
[New LWP 61096]
[New LWP 61097]
[New LWP 61084]
[New LWP 61100]
[New LWP 61103]
[New LWP 61104]
[New LWP 61099]
[New LWP 61105]
[New LWP 61101]
[New LWP 61106]
[New LWP 61109]
[New LWP 61107]
[New LWP 61112]
[New LWP 61119]
[New LWP 61110]
[New LWP 6]
[New LWP 61118]
[New LWP 61123]
[New LWP 61122]
[New LWP 61113]
[New LWP 61114]
[New LWP 61120]
[New LWP 61116]
[New LWP 61115]

warning: core file may not match specified executable file.
Reading symbols from /usr/sbin/glusterfsd...Reading symbols from 
/usr/lib/debug/usr/sbin/glusterfsd-7.4-1.el8722.0800.200415T1052.a.rhel8hpeerikj.x86_64.debug...done.
done.

warning: Ignoring non-absolute filename: 
Missing separate debuginfo for linux-vdso.so.1
Try: dnf --enablerepo='*debug*' install 
/usr/lib/debug/.build-id/06/44254f9cbaa826db070a796046026adba58266

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments
Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id 
gluster/nfs -p /var/run/gluster/n'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x7fe63bb5d7bb in FRAME_DESTROY (frame=0x7fe5ac096288)
at ../../../../libglusterfs/src/glusterfs/stack.h:193
193 FRAME_DESTROY(frame);
[Current thread is 1 (Thread 0x7fe617fff700 (LWP 61102))]
Missing separate debuginfos, use: dnf debuginfo-install 
glibc-2.28-42.el8.x86_64 keyutils-libs-1.5.10-6.el8.x86_64 
krb5-libs-1.16.1-22.el8.x86_64 libacl-2.2.53-1.el8.x86_64 
libattr-2.4.48-3.el8.x86_64 libcom_err-1.44.3-2.el8.x86_64 
libgcc-8.2.1-3.5.el8.x86_64 libselinux-2.8-6.el8.x86_64 
libtirpc-1.1.4-3.el8.x86_64 libuuid-2.32.1-8.el8.x86_64 
openssl-libs-1.1.1-8.el8.x86_64 pcre2-10.32-1.el8.x86_64 
zlib-1.2.11-10.el8.x86_64
(gdb) bt
#0  0x7fe63bb5d7bb in FRAME_DESTROY (frame=0x7fe5ac096288)
at ../../../../libglusterfs/src/glusterfs/stack.h:193
#1  STACK_DESTROY (stack=0x7fe5ac6d65f8)
at ../../../../libglusterfs/src/glusterfs/stack.h:193
#2  rda_fill_fd_cbk (frame=0x7fe5acf18eb8, cookie=,
this=0x7fe63c0162b0, op_ret=3, op_errno=0, entries=,
xdata=0x0) at readdir-ahead.c:623
#3  0x7fe63bd6c3aa in afr_readdir_cbk (frame=,
cookie=, this=, op_ret=,
op_errno=, 

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-15 Thread Erik Jacobson
After several successful runs of the test case, we thought we were
solved. Indeed, split-brain is gone.

But we're triggering a seg fault now, even in a less loaded case.

We're going to switch to gluster74, which was your intention, and report
back.

On Wed, Apr 15, 2020 at 10:33:01AM -0500, Erik Jacobson wrote:
> > Attached the wrong patch by mistake in my previous mail. Sending the correct
> > one now.
> 
> Early results loook GREAT !!
> 
> We'll keep beating on it. We applied it to glsuter72 as that is what we
> have to ship with. It applied fine with some line moves.
> 
> If you would like us to also run a test with gluster74 so that you can
> say that's tested, we can run that test. I can do a special build.
> 
> THANK YOU!!
> 
> > 
> > 
> > -Ravi
> > 
> > 
> > On 15/04/20 2:05 pm, Ravishankar N wrote:
> > 
> > 
> > On 10/04/20 2:06 am, Erik Jacobson wrote:
> > 
> > Once again thanks for sticking with us. Here is a reply from Scott
> > Titus. If you have something for us to try, we'd love it. The code 
> > had
> > your patch applied when gdb was run:
> > 
> > 
> > Here is the addr2line output for those addresses.  Very interesting
> > command, of
> > which I was not aware.
> > 
> > [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
> > cluster/
> > afr.so 0x6f735
> > afr_lookup_metadata_heal_check
> > afr-common.c:2803
> > [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
> > cluster/
> > afr.so 0x6f0b9
> > afr_lookup_done
> > afr-common.c:2455
> > [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
> > cluster/
> > afr.so 0x5c701
> > afr_inode_event_gen_reset
> > afr-common.c:755
> > 
> > 
> > Right, so afr_lookup_done() is resetting the event gen to zero. This 
> > looks
> > like a race between lookup and inode refresh code paths. We made some
> > changes to the event generation logic in AFR. Can you apply the attached
> > patch and see if it fixes the split-brain issue? It should apply 
> > cleanly on
> > glusterfs-7.4.
> > 
> > Thanks,
> > Ravi
> > 
> >
> > 
> > 
> > 
> > 
> > Community Meeting Calendar:
> > 
> > Schedule -
> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> > Bridge: https://bluejeans.com/441850968
> > 
> > Gluster-users mailing list
> > Gluster-users@gluster.org
> > https://lists.gluster.org/mailman/listinfo/gluster-users
> > 
> 
> > >From 11601e709a97ce7c40078866bf5d24b486f39454 Mon Sep 17 00:00:00 2001
> > From: Ravishankar N 
> > Date: Wed, 15 Apr 2020 13:53:26 +0530
> > Subject: [PATCH] afr: event gen changes
> > 
> > The general idea of the changes is to prevent resetting event generation
> > to zero in the inode ctx, since event gen is something that should
> > follow 'causal order'.
> > 
> > Change #1:
> > For a read txn, in inode refresh cbk, if event_generation is
> > found zero, we are failing the read fop. This is not needed
> > because change in event gen is only a marker for the next inode refresh to
> > happen and should not be taken into account by the current read txn.
> > 
> > Change #2:
> > The event gen being zero above can happen if there is a racing lookup,
> > which resets even get (in afr_lookup_done) if there are non zero afr
> > xattrs. The resetting is done only to trigger an inode refresh and a
> > possible client side heal on the next lookup. That can be acheived by
> > setting the need_refresh flag in the inode ctx. So replaced all
> > occurences of resetting even gen to zero with a call to
> > afr_inode_need_refresh_set().
> > 
> > Change #3:
> > In both lookup and discover path, we are doing an inode refresh which is
> > not required since all 3 essentially do the same thing- update the inode
> > ctx with the good/bad copies from the brick replies. Inode refresh also
> > triggers background heals, but I think it is okay to do it when we call
> > refresh during the read and write txns and not in the lookup path.
> > 
> > Change-Id: Id0600dd34b144b4ae7a3bf3c397551adf7e402f1
> > Signed-off-by: Ravishankar N 
> > ---
> >  ...ismatch-resolution-with-fav-child-policy.t |  8 +-
> >  xlators/cluster/afr/src/afr-common.c  | 92 ---
> >  xlators/cluster/afr/src/afr-dir-write.c   |  6 +-
> >  xlators/cluster/afr/src/afr.h |  5 +-
> >  4 files changed, 29 insertions(+), 82 deletions(-)
> > 
> > diff --git 
> > a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t 
> > b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
> > index f4aa351e4..12af0c854 100644
> > --- a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
> > +++ b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
> > @@ -168,8 +168,8 @@ TEST [ "$gfid_1" != "$gfid_2" ]
> >  #We know that second brick has the 

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-15 Thread Erik Jacobson
> Attached the wrong patch by mistake in my previous mail. Sending the correct
> one now.

Early results loook GREAT !!

We'll keep beating on it. We applied it to glsuter72 as that is what we
have to ship with. It applied fine with some line moves.

If you would like us to also run a test with gluster74 so that you can
say that's tested, we can run that test. I can do a special build.

THANK YOU!!

> 
> 
> -Ravi
> 
> 
> On 15/04/20 2:05 pm, Ravishankar N wrote:
> 
> 
> On 10/04/20 2:06 am, Erik Jacobson wrote:
> 
> Once again thanks for sticking with us. Here is a reply from Scott
> Titus. If you have something for us to try, we'd love it. The code had
> your patch applied when gdb was run:
> 
> 
> Here is the addr2line output for those addresses.  Very interesting
> command, of
> which I was not aware.
> 
> [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
> cluster/
> afr.so 0x6f735
> afr_lookup_metadata_heal_check
> afr-common.c:2803
> [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
> cluster/
> afr.so 0x6f0b9
> afr_lookup_done
> afr-common.c:2455
> [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
> cluster/
> afr.so 0x5c701
> afr_inode_event_gen_reset
> afr-common.c:755
> 
> 
> Right, so afr_lookup_done() is resetting the event gen to zero. This looks
> like a race between lookup and inode refresh code paths. We made some
> changes to the event generation logic in AFR. Can you apply the attached
> patch and see if it fixes the split-brain issue? It should apply cleanly 
> on
> glusterfs-7.4.
> 
> Thanks,
> Ravi
> 
>
> 
> 
> 
> 
> Community Meeting Calendar:
> 
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
> 
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
> 

> >From 11601e709a97ce7c40078866bf5d24b486f39454 Mon Sep 17 00:00:00 2001
> From: Ravishankar N 
> Date: Wed, 15 Apr 2020 13:53:26 +0530
> Subject: [PATCH] afr: event gen changes
> 
> The general idea of the changes is to prevent resetting event generation
> to zero in the inode ctx, since event gen is something that should
> follow 'causal order'.
> 
> Change #1:
> For a read txn, in inode refresh cbk, if event_generation is
> found zero, we are failing the read fop. This is not needed
> because change in event gen is only a marker for the next inode refresh to
> happen and should not be taken into account by the current read txn.
> 
> Change #2:
> The event gen being zero above can happen if there is a racing lookup,
> which resets even get (in afr_lookup_done) if there are non zero afr
> xattrs. The resetting is done only to trigger an inode refresh and a
> possible client side heal on the next lookup. That can be acheived by
> setting the need_refresh flag in the inode ctx. So replaced all
> occurences of resetting even gen to zero with a call to
> afr_inode_need_refresh_set().
> 
> Change #3:
> In both lookup and discover path, we are doing an inode refresh which is
> not required since all 3 essentially do the same thing- update the inode
> ctx with the good/bad copies from the brick replies. Inode refresh also
> triggers background heals, but I think it is okay to do it when we call
> refresh during the read and write txns and not in the lookup path.
> 
> Change-Id: Id0600dd34b144b4ae7a3bf3c397551adf7e402f1
> Signed-off-by: Ravishankar N 
> ---
>  ...ismatch-resolution-with-fav-child-policy.t |  8 +-
>  xlators/cluster/afr/src/afr-common.c  | 92 ---
>  xlators/cluster/afr/src/afr-dir-write.c   |  6 +-
>  xlators/cluster/afr/src/afr.h |  5 +-
>  4 files changed, 29 insertions(+), 82 deletions(-)
> 
> diff --git a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t 
> b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
> index f4aa351e4..12af0c854 100644
> --- a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
> +++ b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
> @@ -168,8 +168,8 @@ TEST [ "$gfid_1" != "$gfid_2" ]
>  #We know that second brick has the bigger size file
>  BIGGER_FILE_MD5=$(md5sum $B0/${V0}1/f3 | cut -d\  -f1)
>  
> -TEST ls $M0/f3
> -TEST cat $M0/f3
> +TEST ls $M0 #Trigger entry heal via readdir inode refresh
> +TEST cat $M0/f3 #Trigger data heal via readv inode refresh
>  EXPECT_WITHIN $HEAL_TIMEOUT "^0$" get_pending_heal_count $V0
>  
>  #gfid split-brain should be resolved
> @@ -215,8 +215,8 @@ TEST $CLI volume start $V0 force
>  EXPECT_WITHIN $PROCESS_UP_TIMEOUT "1" brick_up_status $V0 $H0 $B0/${V0}2
>  EXPECT_WITHIN $CHILD_UP_TIMEOUT "1" afr_child_up_status $V0 2
>  
> -TEST ls $M0/f4
> -TEST cat 

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-15 Thread Ravishankar N
Attached the wrong patch by mistake in my previous mail. Sending the 
correct one now.


-Ravi


On 15/04/20 2:05 pm, Ravishankar N wrote:


On 10/04/20 2:06 am, Erik Jacobson wrote:

Once again thanks for sticking with us. Here is a reply from Scott
Titus. If you have something for us to try, we'd love it. The code had
your patch applied when gdb was run:


Here is the addr2line output for those addresses.  Very interesting 
command, of

which I was not aware.

[root@leader3 ~]# addr2line -f 
-e/usr/lib64/glusterfs/7.2/xlator/cluster/

afr.so 0x6f735
afr_lookup_metadata_heal_check
afr-common.c:2803
[root@leader3 ~]# addr2line -f 
-e/usr/lib64/glusterfs/7.2/xlator/cluster/

afr.so 0x6f0b9
afr_lookup_done
afr-common.c:2455
[root@leader3 ~]# addr2line -f 
-e/usr/lib64/glusterfs/7.2/xlator/cluster/

afr.so 0x5c701
afr_inode_event_gen_reset
afr-common.c:755

Right, so afr_lookup_done() is resetting the event gen to zero. This 
looks like a race between lookup and inode refresh code paths. We made 
some changes to the event generation logic in AFR. Can you apply the 
attached patch and see if it fixes the split-brain issue? It should 
apply cleanly on glusterfs-7.4.


Thanks,
Ravi





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users
>From 11601e709a97ce7c40078866bf5d24b486f39454 Mon Sep 17 00:00:00 2001
From: Ravishankar N 
Date: Wed, 15 Apr 2020 13:53:26 +0530
Subject: [PATCH] afr: event gen changes

The general idea of the changes is to prevent resetting event generation
to zero in the inode ctx, since event gen is something that should
follow 'causal order'.

Change #1:
For a read txn, in inode refresh cbk, if event_generation is
found zero, we are failing the read fop. This is not needed
because change in event gen is only a marker for the next inode refresh to
happen and should not be taken into account by the current read txn.

Change #2:
The event gen being zero above can happen if there is a racing lookup,
which resets even get (in afr_lookup_done) if there are non zero afr
xattrs. The resetting is done only to trigger an inode refresh and a
possible client side heal on the next lookup. That can be acheived by
setting the need_refresh flag in the inode ctx. So replaced all
occurences of resetting even gen to zero with a call to
afr_inode_need_refresh_set().

Change #3:
In both lookup and discover path, we are doing an inode refresh which is
not required since all 3 essentially do the same thing- update the inode
ctx with the good/bad copies from the brick replies. Inode refresh also
triggers background heals, but I think it is okay to do it when we call
refresh during the read and write txns and not in the lookup path.

Change-Id: Id0600dd34b144b4ae7a3bf3c397551adf7e402f1
Signed-off-by: Ravishankar N 
---
 ...ismatch-resolution-with-fav-child-policy.t |  8 +-
 xlators/cluster/afr/src/afr-common.c  | 92 ---
 xlators/cluster/afr/src/afr-dir-write.c   |  6 +-
 xlators/cluster/afr/src/afr.h |  5 +-
 4 files changed, 29 insertions(+), 82 deletions(-)

diff --git a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
index f4aa351e4..12af0c854 100644
--- a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
+++ b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
@@ -168,8 +168,8 @@ TEST [ "$gfid_1" != "$gfid_2" ]
 #We know that second brick has the bigger size file
 BIGGER_FILE_MD5=$(md5sum $B0/${V0}1/f3 | cut -d\  -f1)
 
-TEST ls $M0/f3
-TEST cat $M0/f3
+TEST ls $M0 #Trigger entry heal via readdir inode refresh
+TEST cat $M0/f3 #Trigger data heal via readv inode refresh
 EXPECT_WITHIN $HEAL_TIMEOUT "^0$" get_pending_heal_count $V0
 
 #gfid split-brain should be resolved
@@ -215,8 +215,8 @@ TEST $CLI volume start $V0 force
 EXPECT_WITHIN $PROCESS_UP_TIMEOUT "1" brick_up_status $V0 $H0 $B0/${V0}2
 EXPECT_WITHIN $CHILD_UP_TIMEOUT "1" afr_child_up_status $V0 2
 
-TEST ls $M0/f4
-TEST cat $M0/f4
+TEST ls $M0 #Trigger entry heal via readdir inode refresh
+TEST cat $M0/f4  #Trigger data heal via readv inode refresh
 EXPECT_WITHIN $HEAL_TIMEOUT "^0$" get_pending_heal_count $V0
 
 #gfid split-brain should be resolved
diff --git a/xlators/cluster/afr/src/afr-common.c b/xlators/cluster/afr/src/afr-common.c
index 61f21795e..319665a14 100644
--- a/xlators/cluster/afr/src/afr-common.c
+++ b/xlators/cluster/afr/src/afr-common.c
@@ -282,7 +282,7 @@ __afr_set_in_flight_sb_status(xlator_t *this, afr_local_t *local,
 metadatamap |= (1 << index);
 }
 if (metadatamap_old != metadatamap) {
-event = 0;
+__afr_inode_need_refresh_set(inode, this);
 }
 break;
 
@@ -295,7 +295,7 @@ 

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-15 Thread Ravishankar N


On 10/04/20 2:06 am, Erik Jacobson wrote:

Once again thanks for sticking with us. Here is a reply from Scott
Titus. If you have something for us to try, we'd love it. The code had
your patch applied when gdb was run:


Here is the addr2line output for those addresses.  Very interesting command, of
which I was not aware.

[root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/cluster/
afr.so 0x6f735
afr_lookup_metadata_heal_check
afr-common.c:2803
[root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/cluster/
afr.so 0x6f0b9
afr_lookup_done
afr-common.c:2455
[root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/cluster/
afr.so 0x5c701
afr_inode_event_gen_reset
afr-common.c:755

Right, so afr_lookup_done() is resetting the event gen to zero. This 
looks like a race between lookup and inode refresh code paths. We made 
some changes to the event generation logic in AFR. Can you apply the 
attached patch and see if it fixes the split-brain issue? It should 
apply cleanly on glusterfs-7.4.


Thanks,
Ravi
>From 4389908252c886c22897d8c52c0ce027a511453f Mon Sep 17 00:00:00 2001
From: Ravishankar N 
Date: Mon, 24 Dec 2018 13:00:19 +0530
Subject: [PATCH] afr: mark pending xattrs as a part of metadata heal

...if pending xattrs are zero for all children.

Problem:
If there are no pending xattrs and a metadata heal needs to be
performed, it can be possible that we end up with xattrs inadvertendly
deleted from all bricks, as explained in the  BZ.

Fix:
After picking one among the sources as the good copy, mark pending xattrs on
all sources to blame the sinks. Now even if this metadata heal fails midway,
a subsequent heal will still choose one of the valid sources that it
picked previously.

Fixes: #1067
Change-Id: If1b050b70b0ad911e162c04db4d89b263e2b8d7b
Signed-off-by: Ravishankar N 
---
 .../cluster/afr/src/afr-self-heal-metadata.c  | 62 ++-
 1 file changed, 61 insertions(+), 1 deletion(-)

diff --git a/xlators/cluster/afr/src/afr-self-heal-metadata.c b/xlators/cluster/afr/src/afr-self-heal-metadata.c
index f4e31b65b..03f43bad1 100644
--- a/xlators/cluster/afr/src/afr-self-heal-metadata.c
+++ b/xlators/cluster/afr/src/afr-self-heal-metadata.c
@@ -190,6 +190,59 @@ out:
 return ret;
 }
 
+static int
+__afr_selfheal_metadata_mark_pending_xattrs(call_frame_t *frame, xlator_t *this,
+inode_t *inode,
+struct afr_reply *replies,
+unsigned char *sources)
+{
+int ret = 0;
+int i = 0;
+int m_idx = 0;
+afr_private_t *priv = NULL;
+int raw[AFR_NUM_CHANGE_LOGS] = {0};
+dict_t *xattr = NULL;
+
+priv = this->private;
+m_idx = afr_index_for_transaction_type(AFR_METADATA_TRANSACTION);
+raw[m_idx] = 1;
+
+xattr = dict_new();
+if (!xattr)
+return -ENOMEM;
+
+for (i = 0; i < priv->child_count; i++) {
+if (sources[i])
+continue;
+ret = dict_set_static_bin(xattr, priv->pending_key[i], raw,
+  sizeof(int) * AFR_NUM_CHANGE_LOGS);
+if (ret) {
+ret = -1;
+goto out;
+}
+}
+
+for (i = 0; i < priv->child_count; i++) {
+if (!sources[i])
+continue;
+ret = afr_selfheal_post_op(frame, this, inode, i, xattr, NULL);
+if (ret < 0) {
+gf_msg(this->name, GF_LOG_INFO, -ret, AFR_MSG_SELF_HEAL_INFO,
+   "Failed to set pending metadata xattr on child %d for %s", i,
+   uuid_utoa(inode->gfid));
+goto out;
+}
+}
+
+afr_replies_wipe(replies, priv->child_count);
+ret = afr_selfheal_unlocked_discover(frame, inode, inode->gfid, replies);
+
+out:
+if (xattr)
+dict_unref(xattr);
+return ret;
+}
+
 /*
  * Look for mismatching uid/gid or mode or user xattrs even if
  * AFR xattrs don't say so, and pick one arbitrarily as winner. */
@@ -210,6 +263,7 @@ __afr_selfheal_metadata_finalize_source(call_frame_t *frame, xlator_t *this,
 };
 int source = -1;
 int sources_count = 0;
+int ret = 0;
 
 priv = this->private;
 
@@ -300,7 +354,13 @@ __afr_selfheal_metadata_finalize_source(call_frame_t *frame, xlator_t *this,
 healed_sinks[i] = 1;
 }
 }
-
+if ((sources_count == priv->child_count) && (source > -1) &&
+(AFR_COUNT(healed_sinks, priv->child_count) != 0)) {
+ret = __afr_selfheal_metadata_mark_pending_xattrs(frame, this, inode,
+  replies, sources);
+if (ret < 0)
+return ret;
+}
 out:
 afr_mark_active_sinks(this, sources, locked_on, healed_sinks);
 return source;
-- 
2.25.1





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-09 Thread Erik Jacobson
Once again thanks for sticking with us. Here is a reply from Scott
Titus. If you have something for us to try, we'd love it. The code had
your patch applied when gdb was run:


Here is the addr2line output for those addresses.  Very interesting command, of
which I was not aware.

[root@leader3 ~]# addr2line -f -e /usr/lib64/glusterfs/7.2/xlator/cluster/
afr.so 0x6f735
afr_lookup_metadata_heal_check
afr-common.c:2803
[root@leader3 ~]# addr2line -f -e /usr/lib64/glusterfs/7.2/xlator/cluster/
afr.so 0x6f0b9
afr_lookup_done
afr-common.c:2455
[root@leader3 ~]# addr2line -f -e /usr/lib64/glusterfs/7.2/xlator/cluster/
afr.so 0x5c701
afr_inode_event_gen_reset
afr-common.c:755

Thanks
-Scott


On Thu, Apr 09, 2020 at 11:38:04AM +0530, Ravishankar N wrote:
> 
> On 08/04/20 9:55 pm, Erik Jacobson wrote:
> > 9439138:[2020-04-08 15:48:44.737590] E 
> > [afr-common.c:754:afr_inode_event_gen_reset]
> > (-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) 
> > [0x7fa4fb1cb735]
> > -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) 
> > [0x7fa4fb1cb0b9]
> > -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) 
> > [0x7fa4fb1b8701] )
> > 0-cm_shared-replicate-0: Resetting event gen for 
> > f2d7abf0-5444-48d6-863d-4b128502daf9
> > 
> Could you print the function/line no. of each of these 3 functions in the
> backtrace and see who calls afr_inode_event_gen_reset? `addr2line` should
> give you that info:
>  addr2line -f -e /your/path/to/lib/glusterfs/7.2/xlator/cluster/afr.so
> 0x6f735
>  addr2line -f -e /your/path/to/lib/glusterfs/7.2/xlator/cluster/afr.so
> 0x6f0b9
>  addr2line -f -e /your/path/to/lib/glusterfs/7.2/xlator/cluster/afr.so
> 0x5c701
> 
> 
> I think it is likely called from afr_lookup_done, which I don't think is
> necessary. I will send a patch for review. Once reviews are over, I will
> share it with you and if it fixes the issue in your testing, we can merge it
> with confidence.
> 
> Thanks,
> Ravi





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-09 Thread Ravishankar N



On 08/04/20 9:55 pm, Erik Jacobson wrote:

9439138:[2020-04-08 15:48:44.737590] E 
[afr-common.c:754:afr_inode_event_gen_reset]
(-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) 
[0x7fa4fb1cb735]
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) 
[0x7fa4fb1cb0b9]
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) 
[0x7fa4fb1b8701] )
0-cm_shared-replicate-0: Resetting event gen for 
f2d7abf0-5444-48d6-863d-4b128502daf9

Could you print the function/line no. of each of these 3 functions in 
the backtrace and see who calls afr_inode_event_gen_reset? `addr2line` 
should give you that info:
 addr2line -f -e /your/path/to/lib/glusterfs/7.2/xlator/cluster/afr.so 
0x6f735
 addr2line -f -e /your/path/to/lib/glusterfs/7.2/xlator/cluster/afr.so 
0x6f0b9
 addr2line -f -e /your/path/to/lib/glusterfs/7.2/xlator/cluster/afr.so 
0x5c701



I think it is likely called from afr_lookup_done, which I don't think is 
necessary. I will send a patch for review. Once reviews are over, I will 
share it with you and if it fixes the issue in your testing, we can 
merge it with confidence.


Thanks,
Ravi





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-08 Thread Erik Jacobson
Thank you again for the help so far. Here is what Scott Titus came up
with. Let us know if you have suggestions for next steps.



We never hit the "Event gen is zero" message, so it appears that 
afr_access() never has a zero event_gen to begin with.

However, the "Resetting event gen" message was just a bit chatty, 
growing our nfs.log to >2.4GB.  Many were against a gfid of populated 
with zeros.

Around each split brain log, we did find the "Resetting event gen" 
messages containing a matching gfid:

9439138:[2020-04-08 15:48:44.737590] E 
[afr-common.c:754:afr_inode_event_gen_reset] 
(-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) 
[0x7fa4fb1cb735] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) 
[0x7fa4fb1cb0b9] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) 
[0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for 
f2d7abf0-5444-48d6-863d-4b128502daf9
9439139:[2020-04-08 15:48:44.737636] E 
[afr-common.c:754:afr_inode_event_gen_reset] 
(-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) 
[0x7fa4fb1cb735] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) 
[0x7fa4fb1cb0b9] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) 
[0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for 
f2d7abf0-5444-48d6-863d-4b128502daf9
9439140:[2020-04-08 15:48:44.737663] E [MSGID: 108008] 
[afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: 
Failing ACCESS on gfid f2d7abf0-5444-48d6-863d-4b128502daf9: split-brain 
observed. [Input/output error]
9439143:[2020-04-08 15:48:44.737801] E 
[afr-common.c:754:afr_inode_event_gen_reset] 
(-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) 
[0x7fa4fb1cb735] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) 
[0x7fa4fb1cb0b9] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) 
[0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for 
f2d7abf0-5444-48d6-863d-4b128502daf9
9439145:[2020-04-08 15:48:44.737861] E 
[afr-common.c:754:afr_inode_event_gen_reset] 
(-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) 
[0x7fa4fb1cb735] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) 
[0x7fa4fb1cb0b9] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) 
[0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for 
f2d7abf0-5444-48d6-863d-4b128502daf9
9439148:[2020-04-08 15:48:44.738125] E 
[afr-common.c:754:afr_inode_event_gen_reset] 
(-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) 
[0x7fa4fb1cb735] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) 
[0x7fa4fb1cb0b9] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) 
[0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for 
f2d7abf0-5444-48d6-863d-4b128502daf9
9439225:[2020-04-08 15:48:44.749920] E 
[afr-common.c:754:afr_inode_event_gen_reset] 
(-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) 
[0x7fa4fb1cb735] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) 
[0x7fa4fb1cb0b9] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) 
[0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for 
f2d7abf0-5444-48d6-863d-4b128502daf9

Thanks,
-Scott

On 4/8/20 8:31 AM, Erik Jacobson wrote:
> Hi team -
>
> We got an update to try more stuff from the community.
>
> I feel like I've been "given an inch but am taking a mile" but if we
> do happen to have time on orbit41 again, we'll do the next round of
> debugging.
>
> Erik


On Wed, Apr 08, 2020 at 01:53:00PM +0530, Ravishankar N wrote:
> On 08/04/20 4:59 am, Erik Jacobson wrote:
> > Apologies for misinterpreting the backtrace.
> > 
> > #0  afr_read_txn_refresh_done (frame=0x7ffcf4146478,
> > this=0x7fff64013720, err=5) at afr-read-txn.c:312
> > #1  0x7fff68938d2b in afr_txn_refresh_done
> > (frame=frame@entry=0x7ffcf4146478, this=this@entry=0x7fff64013720,
> > err=5, err@entry=0)
> >       at afr-common.c:1222
> Sorry, I missed this too.
> > (gdb) print event_generation
> > $3 = 0
> > 
> > (gdb) print priv->fav_child_policy
> > $4 = AFR_FAV_CHILD_NONE
> > 
> > I am not sure what this signifies though.  It appears to be a read
> > transaction with no event generation and no favorite child policy.
> > 
> > Feel free to ask for clarification in case my thought process went awry
> > somewhere.
> 
> Favorite child policy is only for automatically resolving split-brains and
> is 0 unless that volume option is set. The problem is indeed that
> event_generation is zero. Could you try to apply this logging patch and see
> if afr_inode_event_gen_reset() for that gfid is hit or if afr_access() has a
> zero event_gen to begin with?
> 
> Thanks,
> 
> Ravi
> 
> 
> 
> 
> 

> diff --git a/xlators/cluster/afr/src/afr-common.c 
> b/xlators/cluster/afr/src/afr-common.c
> index 4bfaef9e8..61f21795e 100644
> --- 

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-08 Thread Ravishankar N

On 08/04/20 4:59 am, Erik Jacobson wrote:

Apologies for misinterpreting the backtrace.

#0  afr_read_txn_refresh_done (frame=0x7ffcf4146478,
this=0x7fff64013720, err=5) at afr-read-txn.c:312
#1  0x7fff68938d2b in afr_txn_refresh_done
(frame=frame@entry=0x7ffcf4146478, this=this@entry=0x7fff64013720,
err=5, err@entry=0)
      at afr-common.c:1222

Sorry, I missed this too.

(gdb) print event_generation
$3 = 0

(gdb) print priv->fav_child_policy
$4 = AFR_FAV_CHILD_NONE

I am not sure what this signifies though.  It appears to be a read
transaction with no event generation and no favorite child policy.

Feel free to ask for clarification in case my thought process went awry
somewhere.


Favorite child policy is only for automatically resolving split-brains 
and is 0 unless that volume option is set. The problem is indeed that 
event_generation is zero. Could you try to apply this logging patch and 
see if afr_inode_event_gen_reset() for that gfid is hit or if 
afr_access() has a zero event_gen to begin with?


Thanks,

Ravi





diff --git a/xlators/cluster/afr/src/afr-common.c b/xlators/cluster/afr/src/afr-common.c
index 4bfaef9e8..61f21795e 100644
--- a/xlators/cluster/afr/src/afr-common.c
+++ b/xlators/cluster/afr/src/afr-common.c
@@ -750,6 +750,8 @@ afr_inode_event_gen_reset(inode_t *inode, xlator_t *this)
 
 GF_VALIDATE_OR_GOTO(this->name, inode, out);
 
+gf_msg_callingfn(this->name, GF_LOG_ERROR, 0, AFR_MSG_SPLIT_BRAIN,
+ "Resetting event gen for %s", uuid_utoa(inode->gfid));
 LOCK(>lock);
 {
 ret = __afr_inode_event_gen_reset(inode, this);
diff --git a/xlators/cluster/afr/src/afr-inode-read.c b/xlators/cluster/afr/src/afr-inode-read.c
index 9204add5b..5ac83d6c8 100644
--- a/xlators/cluster/afr/src/afr-inode-read.c
+++ b/xlators/cluster/afr/src/afr-inode-read.c
@@ -172,6 +172,12 @@ afr_access(call_frame_t *frame, xlator_t *this, loc_t *loc, int mask,
 if (xdata)
 local->xdata_req = dict_ref(xdata);
 
+if (local->event_generation == 0)
+gf_msg(this->name, GF_LOG_ERROR, 0, AFR_MSG_SPLIT_BRAIN,
+   "Event gen is zero for %s(%s)", local->loc.name,
+   local->loc.inode->gfid ? uuid_utoa(local->loc.inode->gfid)
+  : "NULL");
+
 afr_read_txn(frame, this, loc->inode, afr_access_wind,
  AFR_METADATA_TRANSACTION);
 




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-07 Thread Erik Jacobson
My co-worker prefers I keep driving the discussion since this isn't his
normal area. But he's far better at digging in to these low level calls
than I am. So I am passing along his analysis so far. We are wondering
if we have enough information yet to turn on any light bulbs in terms of
volume settings, system settings, or a code change... Or a suggested
path for further debug.

Recall from earlier in the thread, this a 3-way replicate single
subvolume gluster fileystem that gets split-brain errors under heavy
gnfs load when one of the three servers has gone down, representing a
customer-reported problem. 

Scott's analysis is below. Next steps truly appreciated !!



Apologies for misinterpreting the backtrace.

#0  afr_read_txn_refresh_done (frame=0x7ffcf4146478, 
this=0x7fff64013720, err=5) at afr-read-txn.c:312
#1  0x7fff68938d2b in afr_txn_refresh_done 
(frame=frame@entry=0x7ffcf4146478, this=this@entry=0x7fff64013720, 
err=5, err@entry=0)
     at afr-common.c:1222
#2  0x7fff68939003 in afr_inode_refresh_done 
(frame=frame@entry=0x7ffcf4146478, this=this@entry=0x7fff64013720, error=0)
     at afr-common.c:1294

instead of the #1/#2 above calling the functions afr_txn_refresh_done 
and afr_inode_refresh_done respectively, they are calling a function 
within afr_txn_refresh_done and afr_inode_refresh_done respectively.

So, afr_txn_refresh_done (frame=frame@entry=0x7ffcf4146478, 
this=this@entry=0x7fff64013720, err=5, err@entry=0)at 
afr-common.c:1222calls a function at line number 1222 in aft-common.c 
within the function afr_txn_refresh_done:

1163: int
1164: afr_txn_refresh_done(call_frame_t *frame, xlator_t *this, int err)
1165: {
1166:     call_frame_t *heal_frame = NULL;
1167:     afr_local_t *heal_local = NULL;
1168:     afr_local_t *local = NULL;
1169:     afr_private_t *priv = NULL;
1170:     inode_t *inode = NULL;
1171:     int event_generation = 0;
1172:     int read_subvol = -1;
1173:     int ret = 0;
1174:
1175:     local = frame->local;
1176:     inode = local->inode;
1177:     priv = this->private;
1178:
1179:     if (err)
1180:     goto refresh_done;
1181:
1182:     if (local->op == GF_FOP_LOOKUP)
1183:     goto refresh_done;
1184:
1185:     ret = afr_inode_get_readable(frame, inode, this, local->readable,
1186: _generation, local->transaction.type);
1187:
1188:     if (ret == -EIO || (local->is_read_txn && !event_generation)) {
1189:     /* No readable subvolume even after refresh ==> splitbrain.*/
*1190: **    if (!priv->fav_child_policy) {*
*1191:   err = EIO;
**1192:     goto refresh_done;
**1193: **    }*
1194:     read_subvol = afr_sh_get_fav_by_policy(this, 
local->replies, inode,
1195: NULL);
1196:     if (read_subvol == -1) {
1197:     err = EIO;
1198:     goto refresh_done;
1199:     }
1200:
1201:     heal_frame = afr_frame_create(this, NULL);
1202:     if (!heal_frame) {
1203:     err = EIO;
1204:     goto refresh_done;
1205:     }
1206:     heal_local = heal_frame->local;
1207:     heal_local->xdata_req = dict_new();
1208:     if (!heal_local->xdata_req) {
1209:     err = EIO;
1210: AFR_STACK_DESTROY(heal_frame);
1211:     goto refresh_done;
1212:     }
1213:     heal_local->heal_frame = frame;
1214:     ret = synctask_new(this->ctx->env, 
afr_fav_child_reset_sink_xattrs,
1215: afr_fav_child_reset_sink_xattrs_cbk, heal_frame,
1216: heal_frame);
1217:     return 0;
1218:    }
1219:
1220: refresh_done:
1221:     afr_local_replies_wipe(local, this->private);
*1222:     local->refreshfn(frame, this, err);*
1223:
1224:     return 0;
1225: }

So, backtrace #1 represents the following function call 
local->refreshfn(frame=frame@entry=0x7ffcf4146478, 
this=this@entry=0x7fff64013720, err=5, err@entry=0)
This is the 1st example of EIO being set.

Setting a breakpoint at *1190: **    if (!priv->fav_child_policy) { 
*reveals that ret is not set, but local->is_read_txn is set and 
event_generation is zero (xlators/cluster/afr/src/afr.h:108), so the 
conditional at 1188 is true.  Furthermore, priv->fav_child_policy is set 
to AFR_FAV_CHILD_NONE which is zero, so we found where the error is set 
to EIO, line 1191.

The following is the gdb output:

(gdb) print ret
$1 = 0
(gdb) print local->is_read_txn
$2 = true
(gdb) print event_generation
$3 = 0

(gdb) print priv->fav_child_policy
$4 = AFR_FAV_CHILD_NONE

I am not sure what this signifies though.  It appears to be a read 
transaction with no event generation and no favorite child policy.

Feel free to ask for clarification in case my thought process went awry 
somewhere.

Thanks,
-Scott



On Thu, Apr 02, 2020 at 02:02:46AM -0500, Erik Jacobson wrote:
> > Hmm, afr_inode_refresh_done() is called with error=0 and by the time we
> > reach afr_txn_refresh_done(), it becomes 5(i.e. EIO).
> > So afr_inode_refresh_done() is changing it to 5. Maybe you can put
> > breakpoints/ log messages in 

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-05 Thread Ravishankar N
afr_final_errno() is called from many other places other than the inode 
refresh code path, so the 2 (ENOENT) could be from one of those (mostly 
afr_lookup_done) but it is puzzling that you are not seeing EIO even 
once when it is called from afr_inode_refresh_subvol_cbk() code path. 
Not sure what is happening here.


On 06/04/20 6:52 am, Erik Jacobson wrote:

During the problem case, near as I can tell, afr_final_errno(),
in the loop where tmp_errno = local->replies[i].op_errno is set,
the errno is always "2" when it gets to that point on server 3 (where
the NFS load is).

I never see a value other than 2.

I later simply put the print at the end of the function too, to double
verify non-zero exit codes. There are thousands of non-zero return
codes, all 2 when not zero. Here is an exmaple flow right before a
split-brain. I do not wish to imply the split-brain is related, it's
just an example log snip:


[2020-04-06 00:54:21.125373] E [MSGID: 0] [afr-common.c:2546:afr_final_errno] 
0-erikj-afr_final_errno: erikj dbg afr_final_errno() errno from loop before 
afr_higher_errno was: 2
[2020-04-06 00:54:21.125374] E [MSGID: 0] [afr-common.c:2551:afr_final_errno] 
0-erikj-afr_final_errno: erikj dbg returning non-zero: 2
[2020-04-06 00:54:23.315397] E [MSGID: 0] 
[afr-read-txn.c:283:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: erikj dbg 
crapola 1st if in afr_read_txn_refresh_done() !priv->thin_arbiter_count -- goto 
to readfn
[2020-04-06 00:54:23.315432] E [MSGID: 108008] 
[afr-read-txn.c:314:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing 
READLINK on gfid 57f269ef-919d-40ec-b7fc-a7906fee648b: split-brain observed. 
[Input/output error]
[2020-04-06 00:54:23.315450] W [MSGID: 112199] 
[nfs3-helpers.c:3327:nfs3_log_readlink_res] 0-nfs-nfsv3: 
/image/images_ro_nfs/rhel8.0/usr/lib64/libmlx5.so.1 => (XID: 1fdba2bc, 
READLINK: NFS: 5(I/O error), POSIX: 5(Input/output error)) target: (null)


I am missing something. I will see if Scott and I can work together
tomorrow. Happy for any more ideas, Thank you!!


On Sun, Apr 05, 2020 at 06:49:56PM -0500, Erik Jacobson wrote:

First, it's possible our analysis is off somewhere. I never get to your
print message. I put a debug statement at the start of the function so I
know we get there (just to verify my print statements were taking
affect).

I put a print statement for the if (call_count == 0) { call there, right
after the if. I ran some tests.

I suspect that isn't a problem area. There were some interesting results
with an NFS stale file handle error going through that path. Otherwise
it's always errno=0 even in the heavy test case. I'm not concerned about
a stale NFS file handle this moment. That print was also hit heavily when
one server was down (which surprised me but I don't know the internals).

I'm trying to re-read and work through Scott's message to see if any
other print statements might be helpful.

Thank you for your help so far. I will reply back if I find something.
Otherwise suggestions welcome!

The MFG system I can access got smaller this weekend but is still large
enough to reproduce the error.

As you can tell, I work mostly at a level well above filesystem code so
thank you for staying with me as I struggle through this.

Erik


After we hear from all children, afr_inode_refresh_subvol_cbk() then calls 
afr_inode_refresh_done()-->afr_txn_refresh_done()-->afr_read_txn_refresh_done().
But you already know this flow now.
diff --git a/xlators/cluster/afr/src/afr-common.c 
b/xlators/cluster/afr/src/afr-common.c
index 4bfaef9e8..096ce06f0 100644
--- a/xlators/cluster/afr/src/afr-common.c
+++ b/xlators/cluster/afr/src/afr-common.c
@@ -1318,6 +1318,12 @@ afr_inode_refresh_subvol_cbk(call_frame_t *frame, void 
*cookie, xlator_t *this,
  if (xdata)
  local->replies[call_child].xdata = dict_ref(xdata);
  }
+if (op_ret == -1)
+gf_msg_callingfn(
+this->name, GF_LOG_ERROR, op_errno, AFR_MSG_SPLIT_BRAIN,
+"Inode refresh on child:%d failed with errno:%d for %s(%s) ",
+call_child, op_errno, local->loc.name,
+uuid_utoa(local->loc.inode->gfid));
  if (xdata) {
  ret = dict_get_int8(xdata, "link-count", _heal);
  local->replies[call_child].need_heal = need_heal;








Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-05 Thread Erik Jacobson
During the problem case, near as I can tell, afr_final_errno(),
in the loop where tmp_errno = local->replies[i].op_errno is set,
the errno is always "2" when it gets to that point on server 3 (where
the NFS load is).

I never see a value other than 2.

I later simply put the print at the end of the function too, to double
verify non-zero exit codes. There are thousands of non-zero return
codes, all 2 when not zero. Here is an exmaple flow right before a
split-brain. I do not wish to imply the split-brain is related, it's
just an example log snip:


[2020-04-06 00:54:21.125373] E [MSGID: 0] [afr-common.c:2546:afr_final_errno] 
0-erikj-afr_final_errno: erikj dbg afr_final_errno() errno from loop before 
afr_higher_errno was: 2
[2020-04-06 00:54:21.125374] E [MSGID: 0] [afr-common.c:2551:afr_final_errno] 
0-erikj-afr_final_errno: erikj dbg returning non-zero: 2
[2020-04-06 00:54:23.315397] E [MSGID: 0] 
[afr-read-txn.c:283:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: erikj 
dbg crapola 1st if in afr_read_txn_refresh_done() !priv->thin_arbiter_count -- 
goto to readfn
[2020-04-06 00:54:23.315432] E [MSGID: 108008] 
[afr-read-txn.c:314:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing 
READLINK on gfid 57f269ef-919d-40ec-b7fc-a7906fee648b: split-brain observed. 
[Input/output error]
[2020-04-06 00:54:23.315450] W [MSGID: 112199] 
[nfs3-helpers.c:3327:nfs3_log_readlink_res] 0-nfs-nfsv3: 
/image/images_ro_nfs/rhel8.0/usr/lib64/libmlx5.so.1 => (XID: 1fdba2bc, 
READLINK: NFS: 5(I/O error), POSIX: 5(Input/output error)) target: (null)


I am missing something. I will see if Scott and I can work together
tomorrow. Happy for any more ideas, Thank you!!


On Sun, Apr 05, 2020 at 06:49:56PM -0500, Erik Jacobson wrote:
> First, it's possible our analysis is off somewhere. I never get to your
> print message. I put a debug statement at the start of the function so I
> know we get there (just to verify my print statements were taking
> affect).
> 
> I put a print statement for the if (call_count == 0) { call there, right
> after the if. I ran some tests.
> 
> I suspect that isn't a problem area. There were some interesting results
> with an NFS stale file handle error going through that path. Otherwise
> it's always errno=0 even in the heavy test case. I'm not concerned about
> a stale NFS file handle this moment. That print was also hit heavily when
> one server was down (which surprised me but I don't know the internals).
> 
> I'm trying to re-read and work through Scott's message to see if any
> other print statements might be helpful.
> 
> Thank you for your help so far. I will reply back if I find something.
> Otherwise suggestions welcome!
> 
> The MFG system I can access got smaller this weekend but is still large
> enough to reproduce the error.
> 
> As you can tell, I work mostly at a level well above filesystem code so
> thank you for staying with me as I struggle through this.
> 
> Erik
> 
> > After we hear from all children, afr_inode_refresh_subvol_cbk() then calls 
> > afr_inode_refresh_done()-->afr_txn_refresh_done()-->afr_read_txn_refresh_done().
> > But you already know this flow now.
> 
> > diff --git a/xlators/cluster/afr/src/afr-common.c 
> > b/xlators/cluster/afr/src/afr-common.c
> > index 4bfaef9e8..096ce06f0 100644
> > --- a/xlators/cluster/afr/src/afr-common.c
> > +++ b/xlators/cluster/afr/src/afr-common.c
> > @@ -1318,6 +1318,12 @@ afr_inode_refresh_subvol_cbk(call_frame_t *frame, 
> > void *cookie, xlator_t *this,
> >  if (xdata)
> >  local->replies[call_child].xdata = dict_ref(xdata);
> >  }
> > +if (op_ret == -1)
> > +gf_msg_callingfn(
> > +this->name, GF_LOG_ERROR, op_errno, AFR_MSG_SPLIT_BRAIN,
> > +"Inode refresh on child:%d failed with errno:%d for %s(%s) ",
> > +call_child, op_errno, local->loc.name,
> > +uuid_utoa(local->loc.inode->gfid));
> >  if (xdata) {
> >  ret = dict_get_int8(xdata, "link-count", _heal);
> >  local->replies[call_child].need_heal = need_heal;






Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-05 Thread Erik Jacobson
First, it's possible our analysis is off somewhere. I never get to your
print message. I put a debug statement at the start of the function so I
know we get there (just to verify my print statements were taking
affect).

I put a print statement for the if (call_count == 0) { call there, right
after the if. I ran some tests.

I suspect that isn't a problem area. There were some interesting results
with an NFS stale file handle error going through that path. Otherwise
it's always errno=0 even in the heavy test case. I'm not concerned about
a stale NFS file handle this moment. That print was also hit heavily when
one server was down (which surprised me but I don't know the internals).

I'm trying to re-read and work through Scott's message to see if any
other print statements might be helpful.

Thank you for your help so far. I will reply back if I find something.
Otherwise suggestions welcome!

The MFG system I can access got smaller this weekend but is still large
enough to reproduce the error.

As you can tell, I work mostly at a level well above filesystem code so
thank you for staying with me as I struggle through this.

Erik

> After we hear from all children, afr_inode_refresh_subvol_cbk() then calls 
> afr_inode_refresh_done()-->afr_txn_refresh_done()-->afr_read_txn_refresh_done().
> But you already know this flow now.

> diff --git a/xlators/cluster/afr/src/afr-common.c 
> b/xlators/cluster/afr/src/afr-common.c
> index 4bfaef9e8..096ce06f0 100644
> --- a/xlators/cluster/afr/src/afr-common.c
> +++ b/xlators/cluster/afr/src/afr-common.c
> @@ -1318,6 +1318,12 @@ afr_inode_refresh_subvol_cbk(call_frame_t *frame, void 
> *cookie, xlator_t *this,
>  if (xdata)
>  local->replies[call_child].xdata = dict_ref(xdata);
>  }
> +if (op_ret == -1)
> +gf_msg_callingfn(
> +this->name, GF_LOG_ERROR, op_errno, AFR_MSG_SPLIT_BRAIN,
> +"Inode refresh on child:%d failed with errno:%d for %s(%s) ",
> +call_child, op_errno, local->loc.name,
> +uuid_utoa(local->loc.inode->gfid));
>  if (xdata) {
>  ret = dict_get_int8(xdata, "link-count", _heal);
>  local->replies[call_child].need_heal = need_heal;





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-05 Thread Ravishankar N

On 04/04/20 9:12 pm, Erik Jacobson wrote:

This leaves us with afr_quorum_errno() returning the error.

afr_final_errno() iterates through the 'children', looking for
valid errors within the replies for the transaction (refresh transaction?).
The function returns the highest valued error, which must be EIO (value of 5)
in this case.

I have not looked into how or what would set the error value in the
replies array,


The errror numbers that you see in the replies array in 
afr_final_errno() are set in afr_inode_refresh_subvol_cbk().


During inode refresh (which is essentially a lookup), AFR sends the the 
lookup request on all its connected children and the replies from each 
one of them are captured in afr_inode_refresh_subvol_cbk(). So adding a 
log here can identify if we got EIO from any of its children. See 
attached patch for an example.


After we hear from all children, afr_inode_refresh_subvol_cbk() then 
calls 
afr_inode_refresh_done()-->afr_txn_refresh_done()-->afr_read_txn_refresh_done(). 
But you already know this flow now.
diff --git a/xlators/cluster/afr/src/afr-common.c b/xlators/cluster/afr/src/afr-common.c
index 4bfaef9e8..096ce06f0 100644
--- a/xlators/cluster/afr/src/afr-common.c
+++ b/xlators/cluster/afr/src/afr-common.c
@@ -1318,6 +1318,12 @@ afr_inode_refresh_subvol_cbk(call_frame_t *frame, void *cookie, xlator_t *this,
 if (xdata)
 local->replies[call_child].xdata = dict_ref(xdata);
 }
+if (op_ret == -1)
+gf_msg_callingfn(
+this->name, GF_LOG_ERROR, op_errno, AFR_MSG_SPLIT_BRAIN,
+"Inode refresh on child:%d failed with errno:%d for %s(%s) ",
+call_child, op_errno, local->loc.name,
+uuid_utoa(local->loc.inode->gfid));
 if (xdata) {
 ret = dict_get_int8(xdata, "link-count", _heal);
 local->replies[call_child].need_heal = need_heal;




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-04 Thread Erik Jacobson
I had a co-worker look through this with me (Scott Titus). He has a more
analyitcal mind than I do. Here is what he said with some edits by me.
My edits were formatting and adjusting some words. So we were hoping
that, given this analysis, the community could let us know if it raises
any red flags that would lead to a solution to the problem (whether it
be setup, settings, or code). If needed, I can get Scott to work with me
and dig further but it was starting to get painful where Scott stopped.

Scott's words (edited):

(all backtraces match - at least up to the point I'm concerned with at this
time)

Error was passed from afr_inode_refresh_done() into afr_txn_refresh_done() as
afr_inode_refresh_done()'s call frame has 'error=0'
while afr_txn_refresh_done() has 'err=5' in the call frame.


#0  afr_read_txn_refresh_done (frame=0x7ffc949cf7c8, this=0x7fff640137b0,
err=5) at afr-read-txn.c:281
#1  0x7fff68901fdb in afr_txn_refresh_done (
frame=frame at entry=0x7ffc949cf7c8, this=this at entry=0x7fff640137b0,
err=5,
err at entry=0) at afr-common.c:1223
#2  0x7fff689022b3 in afr_inode_refresh_done (
frame=frame at entry=0x7ffc949cf7c8, this=this at entry=0x7fff640137b0,
error=0)
at afr-common.c:1295
#3  0x7fff6890f3fb in afr_inode_refresh_subvol_cbk (frame=0x7ffc949cf7c8,
cookie=, this=0x7fff640137b0, op_ret=,
op_errno=, buf=buf at entry=0x7ffd53ffdaa0,
xdata=0x7ffd3c6764f8, par=0x7ffd53ffdb40) at afr-common.c:1333


Within afr_inode_refresh_done(), the only two ways it can generate an error
within is via setting it to EINVAL or resulting from a failure status from
afr_has_quorum().  Since EINVAL is 22, not 5, the quorum test failed.

Within the afr_has_quorum() conditional, an error could be set
from afr_final_errno() or afr_quorum_errno().  Digging reveals
afr_quorum_errno() just returns ENOTCONN which is 107, so that is not it.
This leaves us with afr_quorum_errno() returning the error.

(Scott provided me with source code with pieces bolded but I don't think
you need that).

afr_final_errno() iterates through the 'children', looking for
valid errors within the replies for the transaction (refresh transaction?).
The function returns the highest valued error, which must be EIO (value of 5)
in this case.

I have not looked into how or what would set the error value in the
replies array, as this being a distributed system the error could have been
generated on another server. Unless this path needs to be investigated, I'd
rather not get mired into finding which iteration (value of 'i') has the error
and what system? thread?  added the error to the reply unless it is
information that is required.



Any suggested next steps?

> 
> On 01/04/20 8:57 am, Erik Jacobson wrote:
> > Here are some back traces. They make my head hurt. Maybe you can suggest
> > something else to try next? In the morning I'll try to unwind this
> > myself too in the source code but I suspect it will be tough for me.
> > 
> > 
> > (gdb) break xlators/cluster/afr/src/afr-read-txn.c:280 if err == 5
> > Breakpoint 1 at 0x7fff688e057b: file afr-read-txn.c, line 281.
> > (gdb) continue
> > Continuing.
> > [Switching to Thread 0x7ffec700 (LWP 50175)]
> > 
> > Thread 15 "glfs_epoll007" hit Breakpoint 1, afr_read_txn_refresh_done (
> >  frame=0x7fff48325d78, this=0x7fff640137b0, err=5) at afr-read-txn.c:281
> > 281 if (err) {
> > (gdb) bt
> > #0  afr_read_txn_refresh_done (frame=0x7fff48325d78, this=0x7fff640137b0,
> >  err=5) at afr-read-txn.c:281
> > #1  0x7fff68901fdb in afr_txn_refresh_done (
> >  frame=frame@entry=0x7fff48325d78, this=this@entry=0x7fff640137b0, 
> > err=5,
> >  err@entry=0) at afr-common.c:1223
> > #2  0x7fff689022b3 in afr_inode_refresh_done (
> >  frame=frame@entry=0x7fff48325d78, this=this@entry=0x7fff640137b0, 
> > error=0)
> >  at afr-common.c:1295
> Hmm, afr_inode_refresh_done() is called with error=0 and by the time we
> reach afr_txn_refresh_done(), it becomes 5(i.e. EIO).
> So afr_inode_refresh_done() is changing it to 5. Maybe you can put
> breakpoints/ log messages in afr_inode_refresh_done() at the places where
> error is getting changed and see where the assignment happens.
> 
> 
> Regards,
> Ravi





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-02 Thread Erik Jacobson
> Hmm, afr_inode_refresh_done() is called with error=0 and by the time we
> reach afr_txn_refresh_done(), it becomes 5(i.e. EIO).
> So afr_inode_refresh_done() is changing it to 5. Maybe you can put
> breakpoints/ log messages in afr_inode_refresh_done() at the places where
> error is getting changed and see where the assignment happens.

I had a lot of struggles tonight getting the system ready to go.  I had
seg11's in glusterfs(nfs) but I think it was related to not all brick
processes stopping with glusterd. I also re-installed and/or the print
statements. I'm not sure. I'm not used to seeing that.

I put print statements everywhere I thought error could change and got
no printed log messages.

I put break points where error would change and we didn't hit them.

I then point a breakpoint at

break xlators/cluster/afr/src/afr-common.c:1298 if error != 0

---> refresh_done:
afr_txn_refresh_done(frame, this, error);

And it never triggered (despite split-brain messages and my crapola
message).

So I'm unable to explain this transition. I'm also not a gdb expert.
I still see the same back trace though.

#1  0x7fff68938d7b in afr_txn_refresh_done (
frame=frame@entry=0x7ffd540ed8e8, this=this@entry=0x7fff64013720, err=5,
err@entry=0) at afr-common.c:1222
#2  0x7fff689391f0 in afr_inode_refresh_done (
frame=frame@entry=0x7ffd540ed8e8, this=this@entry=0x7fff64013720, error=0)
at afr-common.c:1299

Is there other advice you might have for me to try?

I'm very eager to solve this problem, which is why I'm staying up late
to get machine time. I must go to bed now. I look forward to another
shot tomorrow night if you have more ideas to try.

Erik




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-03-31 Thread Ravishankar N



On 01/04/20 8:57 am, Erik Jacobson wrote:

Here are some back traces. They make my head hurt. Maybe you can suggest
something else to try next? In the morning I'll try to unwind this
myself too in the source code but I suspect it will be tough for me.


(gdb) break xlators/cluster/afr/src/afr-read-txn.c:280 if err == 5
Breakpoint 1 at 0x7fff688e057b: file afr-read-txn.c, line 281.
(gdb) continue
Continuing.
[Switching to Thread 0x7ffec700 (LWP 50175)]

Thread 15 "glfs_epoll007" hit Breakpoint 1, afr_read_txn_refresh_done (
 frame=0x7fff48325d78, this=0x7fff640137b0, err=5) at afr-read-txn.c:281
281 if (err) {
(gdb) bt
#0  afr_read_txn_refresh_done (frame=0x7fff48325d78, this=0x7fff640137b0,
 err=5) at afr-read-txn.c:281
#1  0x7fff68901fdb in afr_txn_refresh_done (
 frame=frame@entry=0x7fff48325d78, this=this@entry=0x7fff640137b0, err=5,
 err@entry=0) at afr-common.c:1223
#2  0x7fff689022b3 in afr_inode_refresh_done (
 frame=frame@entry=0x7fff48325d78, this=this@entry=0x7fff640137b0, error=0)
 at afr-common.c:1295
Hmm, afr_inode_refresh_done() is called with error=0 and by the time we 
reach afr_txn_refresh_done(), it becomes 5(i.e. EIO).
So afr_inode_refresh_done() is changing it to 5. Maybe you can put 
breakpoints/ log messages in afr_inode_refresh_done() at the places 
where error is getting changed and see where the assignment happens.



Regards,
Ravi





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-03-31 Thread Erik Jacobson
THANK YOU for the hints. Very happy to have the help.

I'll reply to a couple things then dig in:

On Tue, Mar 31, 2020 at 03:27:59PM +0530, Ravishankar N wrote:
> From your reply in the other thread, I'm assuming that the file/gfid in
> question is not in genuine split-brain or needing heal. i.e. for example

Right, they were not tagged split-brain either, just healing needed,
which is expected for those 76 files.

> with that 1 brick down and 2 bricks up test case, if you tried to read the
> file from say a temporary fuse mount (which is also now connected to only to
> 2 bricks since the 3rd one is down) it works fine and there is no EIO
> error...

Looking at the heal info, all files are the files I expected to have
write changes and I think* are outside the scope of this issue. To
close the loop, I ran a 'strings' on the top of one the files to confirm
from a fuse mount and had no trouble.

> ...which means that what you have observed is true, i.e.
> afr_read_txn_refresh_done() is called with err=EIO. You can add logs to see
> at what point it is EIO set. The call graph is like this: 
> afr_inode_refresh_done()-->afr_txn_refresh_done()-->afr_read_txn_refresh_done().
> 
> Maybe 
> https://github.com/gluster/glusterfs/blob/v7.4/xlators/cluster/afr/src/afr-common.c#L1188
> in afr_txn_refresh_done() is causing it either due to ret being -EIO or
> event_generation being zero.
> 
> If you are comfortable with gdb, you an put a conditional break point in
> afr_read_txn_refresh_done() at 
> https://github.com/gluster/glusterfs/blob/v7.4/xlators/cluster/afr/src/afr-read-txn.c#L283
> when err=EIO and then check the backtrace for who is setting err to EIO.

Ok so the main event! :)

I'm not a gdb expert but I think I figured it out well enough to paste
some back traces. However, I'm having trouble intepreting them exactly.
It looks to me to be the "event" case.

(I got permission to use this MFG system at night for a couple more
nights; avoiding the 24-hour-reserved internal larger system we have).

here is what I did, feel free to suggest something better.

- I am using an RPM build so I changed the spec file to create debuginfo
  packages. I'm on rhel8.1
- I installed the updated packages and debuginfo packages
- When glusterd started the nfs glusterfs, I killed it.
- I ran this:
gdb -d /root/rpmbuild/BUILD/glusterfs-7.2 -d 
/root/rpmbuild/BUILD/glusterfs-7.2/xlators/cluster/afr/src/ /usr/sbin/glusterfs

- Then from GDB, I ran this:
(gdb) run -s localhost --volfile-id gluster/nfs -p /var/run/gluster/nfs/nfs.pid 
-l /var/log/glusterfs/nfs.log -S /var/run/gluster/9ddb5561058ff543.socket -N

- I hit ctrl-c, then set the break point:
(gdb) break xlators/cluster/afr/src/afr-read-txn.c:280 if err == 5
- I have some debugging statements but gluster 72 line 280 is this:
-->  line 280 (I think gdb changed it to 281 internally)
if (err) {
if (!priv->thin_arbiter_count) {

- continue

- Then I ran the test case.


Here are some back traces. They make my head hurt. Maybe you can suggest
something else to try next? In the morning I'll try to unwind this
myself too in the source code but I suspect it will be tough for me.


(gdb) break xlators/cluster/afr/src/afr-read-txn.c:280 if err == 5
Breakpoint 1 at 0x7fff688e057b: file afr-read-txn.c, line 281.
(gdb) continue
Continuing.
[Switching to Thread 0x7ffec700 (LWP 50175)]

Thread 15 "glfs_epoll007" hit Breakpoint 1, afr_read_txn_refresh_done (
frame=0x7fff48325d78, this=0x7fff640137b0, err=5) at afr-read-txn.c:281
281 if (err) {
(gdb) bt
#0  afr_read_txn_refresh_done (frame=0x7fff48325d78, this=0x7fff640137b0, 
err=5) at afr-read-txn.c:281
#1  0x7fff68901fdb in afr_txn_refresh_done (
frame=frame@entry=0x7fff48325d78, this=this@entry=0x7fff640137b0, err=5, 
err@entry=0) at afr-common.c:1223
#2  0x7fff689022b3 in afr_inode_refresh_done (
frame=frame@entry=0x7fff48325d78, this=this@entry=0x7fff640137b0, error=0)
at afr-common.c:1295
#3  0x7fff6890f3fb in afr_inode_refresh_subvol_cbk (frame=0x7fff48325d78, 
cookie=, this=0x7fff640137b0, op_ret=, 
op_errno=, buf=buf@entry=0x7ffecfffdaa0, 
xdata=0x7ffeb806ef08, par=0x7ffecfffdb40) at afr-common.c:1333
#4  0x7fff6890f42a in afr_inode_refresh_subvol_with_lookup_cbk (
frame=, cookie=, this=, 
op_ret=, op_errno=, inode=, 
buf=0x7ffecfffdaa0, xdata=0x7ffeb806ef08, par=0x7ffecfffdb40)
at afr-common.c:1344
#5  0x7fff68b8e96f in client4_0_lookup_cbk (req=, 
iov=, count=, myframe=0x7fff483147b8)
at client-rpc-fops_v2.c:2640
#6  0x7fffed293115 in rpc_clnt_handle_reply (
clnt=clnt@entry=0x7fff640671b0, pollin=pollin@entry=0x7ffeb81aa110)
at rpc-clnt.c:764
#7  0x7fffed2934b3 in rpc_clnt_notify (trans=0x7fff64067540, 
mydata=0x7fff640671e0, event=, data=0x7ffeb81aa110)
at rpc-clnt.c:931
#8  0x7fffed29007b in rpc_transport_notify (
this=this@entry=0x7fff64067540, 

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-03-31 Thread Ravishankar N
From your reply in the other thread, I'm assuming that the file/gfid in 
question is not in genuine split-brain or needing heal. i.e. for example 
with that 1 brick down and 2 bricks up test case, if you tried to read 
the file from say a temporary fuse mount (which is also now connected to 
only to 2 bricks since the 3rd one is down) it works fine and there is 
no EIO error...


...which means that what you have observed is true, i.e. 
afr_read_txn_refresh_done() is called with err=EIO. You can add logs to 
see at what point it is EIO set. The call graph is like this: 
afr_inode_refresh_done()-->afr_txn_refresh_done()-->afr_read_txn_refresh_done().


Maybe 
https://github.com/gluster/glusterfs/blob/v7.4/xlators/cluster/afr/src/afr-common.c#L1188 
in afr_txn_refresh_done() is causing it either due to ret being -EIO or 
event_generation being zero.


If you are comfortable with gdb, you an put a conditional break point in 
afr_read_txn_refresh_done() at 
https://github.com/gluster/glusterfs/blob/v7.4/xlators/cluster/afr/src/afr-read-txn.c#L283 
when err=EIO and then check the backtrace for who is setting err to EIO.


Regards,
Ravi
On 31/03/20 12:20 pm, Erik Jacobson wrote:

I note that this part of  afr_read_txn() gets triggered a lot.

 if (afr_is_inode_refresh_reqd(inode, this, local->event_generation,
   event_generation)) {

Maybe that's normal when one of the three servers are down (but why
isn't it using its local copy by default?)

The comment in that if block is:
 /* servers have disconnected / reconnected, and possibly
rebooted, very likely changing the state of freshness
of copies */

But we have one server conssitently down, not a changing situation.

digging digging digging seemed to show this related to cache
invalidation Because the paths seemed to suggest the inode needed
refreshing and that seems handled by a case statement named
GF_UPCALL_CACHE_INVALIDATION

However, that must have been a wrong turn since turning off
cache invalidation didn't help.

I'm struggling to wrap my head around the code base and without the
background in these concepts it's a tough hill to climb.

I am going to have to try this again some day with fresh eyes and go to
bed; the machine I have easy access to is going away in the morning.
Now I'll have to reserve time on a contended one but I will do that and
continue digging.

Any suggestions would be greatly appreciated as I think I'm starting to
tip over here on this one.


On Mon, Mar 30, 2020 at 04:04:39PM -0500, Erik Jacobson wrote:

Sadly I am not a  developer,  so I can't answer your questions.

I'm not a FS o rnetwork developer either. I think there is a joke about
playing one on TV but maybe it's netflix now.

Enabling certain debug options made too much information for me to watch
personally (but an expert could probably get through it).

So I started putting targeted 'print' (gf_msg) statements in the code to
see how it got its way to split-brain. Maybe this will ring a bell
for someone.

I can tell the only way we enter the split-brain path is through in the
first if statement of afr_read_txn_refresh_done().

This means afr_read_txn_refresh_done() itself was passed "err" and
that it appears thin_arbiter_count was not set (which makes sense,
I'm using 1x3, not a thin arbiter).

So we jump to the readfn label, and read_subvol() should still be -1.
If I read right, it must mean that this if didn't return true because
my print statement didn't appear:
if ((ret == 0) && spb_choice >= 0) {

So we're still with the original read_subvol == 1,
Which gets us to the split_brain message.

So now I will try to learn why afr_read_txn_refresh_done() would have
'err' set in the first place. I will also learn about
afr_inode_split_brain_choice_get(). Those seem to be the two methods to
have avoided falling in to the split brain hole here.


I put debug statements in these locations. I will mark with !! what
I see:



diff -Narup glusterfs-7.2-orig/xlators/cluster/afr/src/afr-read-txn.c 
glusterfs-7.2-new/xlators/cluster/afr/src/afr-read-txn.c
--- glusterfs-7.2-orig/xlators/cluster/afr/src/afr-read-txn.c   2020-01-15 
11:43:53.887894293 -0600
+++ glusterfs-7.2-new/xlators/cluster/afr/src/afr-read-txn.c2020-03-30 
15:45:02.917104321 -0500
@@ -279,10 +279,14 @@ afr_read_txn_refresh_done(call_frame_t *
  priv = this->private;

  if (err) {
-if (!priv->thin_arbiter_count)
+if (!priv->thin_arbiter_count) {
+gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg crapola 1st if in 
afr_read_txn_refresh_done() !priv->thin_arbiter_count -- goto to readfn");
!!
We hit this error condition and jump to readfn below
!!!
  goto readfn;
-if (err != EINVAL)
+}
+if (err != EINVAL) {
+gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj 2nd if in 
afr_read_txn_refresh_done() err != EINVAL, goto readfn");
  

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-03-31 Thread Erik Jacobson
I note that this part of  afr_read_txn() gets triggered a lot.

if (afr_is_inode_refresh_reqd(inode, this, local->event_generation,
  event_generation)) {

Maybe that's normal when one of the three servers are down (but why
isn't it using its local copy by default?)

The comment in that if block is:
/* servers have disconnected / reconnected, and possibly
   rebooted, very likely changing the state of freshness
   of copies */

But we have one server conssitently down, not a changing situation.

digging digging digging seemed to show this related to cache
invalidation Because the paths seemed to suggest the inode needed
refreshing and that seems handled by a case statement named
GF_UPCALL_CACHE_INVALIDATION

However, that must have been a wrong turn since turning off
cache invalidation didn't help.

I'm struggling to wrap my head around the code base and without the
background in these concepts it's a tough hill to climb.

I am going to have to try this again some day with fresh eyes and go to
bed; the machine I have easy access to is going away in the morning.
Now I'll have to reserve time on a contended one but I will do that and
continue digging.

Any suggestions would be greatly appreciated as I think I'm starting to
tip over here on this one.


On Mon, Mar 30, 2020 at 04:04:39PM -0500, Erik Jacobson wrote:
> > Sadly I am not a  developer,  so I can't answer your questions.
> 
> I'm not a FS o rnetwork developer either. I think there is a joke about
> playing one on TV but maybe it's netflix now.
> 
> Enabling certain debug options made too much information for me to watch
> personally (but an expert could probably get through it).
> 
> So I started putting targeted 'print' (gf_msg) statements in the code to
> see how it got its way to split-brain. Maybe this will ring a bell
> for someone.
> 
> I can tell the only way we enter the split-brain path is through in the
> first if statement of afr_read_txn_refresh_done().
> 
> This means afr_read_txn_refresh_done() itself was passed "err" and
> that it appears thin_arbiter_count was not set (which makes sense,
> I'm using 1x3, not a thin arbiter).
> 
> So we jump to the readfn label, and read_subvol() should still be -1.
> If I read right, it must mean that this if didn't return true because
> my print statement didn't appear:
> if ((ret == 0) && spb_choice >= 0) {
> 
> So we're still with the original read_subvol == 1,
> Which gets us to the split_brain message.
> 
> So now I will try to learn why afr_read_txn_refresh_done() would have
> 'err' set in the first place. I will also learn about
> afr_inode_split_brain_choice_get(). Those seem to be the two methods to
> have avoided falling in to the split brain hole here.
> 
> 
> I put debug statements in these locations. I will mark with !! what
> I see:
> 
> 
> 
> diff -Narup glusterfs-7.2-orig/xlators/cluster/afr/src/afr-read-txn.c 
> glusterfs-7.2-new/xlators/cluster/afr/src/afr-read-txn.c
> --- glusterfs-7.2-orig/xlators/cluster/afr/src/afr-read-txn.c 2020-01-15 
> 11:43:53.887894293 -0600
> +++ glusterfs-7.2-new/xlators/cluster/afr/src/afr-read-txn.c  2020-03-30 
> 15:45:02.917104321 -0500
> @@ -279,10 +279,14 @@ afr_read_txn_refresh_done(call_frame_t *
>  priv = this->private;
> 
>  if (err) {
> -if (!priv->thin_arbiter_count)
> +if (!priv->thin_arbiter_count) {
> +gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg crapola 1st if in 
> afr_read_txn_refresh_done() !priv->thin_arbiter_count -- goto to readfn");
> !!
> We hit this error condition and jump to readfn below
> !!!
>  goto readfn;
> -if (err != EINVAL)
> +}
> +if (err != EINVAL) {
> +gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj 2nd if in 
> afr_read_txn_refresh_done() err != EINVAL, goto readfn");
>  goto readfn;
> +}
>  /* We need to query the good bricks and/or thin-arbiter.*/
>  afr_ta_read_txn_synctask(frame, this);
>  return 0;
> @@ -291,6 +295,8 @@ afr_read_txn_refresh_done(call_frame_t *
>  read_subvol = afr_read_subvol_select_by_policy(inode, this, 
> local->readable,
> NULL);
>  if (read_subvol == -1) {
> +gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg whoops read_subvol 
> returned -1, going to readfn");
> +
>  err = EIO;
>  goto readfn;
>  }
> @@ -304,11 +310,15 @@ afr_read_txn_refresh_done(call_frame_t *
>  readfn:
>  if (read_subvol == -1) {
>  ret = afr_inode_split_brain_choice_get(inode, this, _choice);
> -if ((ret == 0) && spb_choice >= 0)
> +if ((ret == 0) && spb_choice >= 0) {
> !!
> We never get here, afr_inode_split_brain_choice_get() must not have
> returned what was needed to enter.
> !!
> +gf_msg(this->name, 

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-03-30 Thread Erik Jacobson
> Sadly I am not a  developer,  so I can't answer your questions.

I'm not a FS o rnetwork developer either. I think there is a joke about
playing one on TV but maybe it's netflix now.

Enabling certain debug options made too much information for me to watch
personally (but an expert could probably get through it).

So I started putting targeted 'print' (gf_msg) statements in the code to
see how it got its way to split-brain. Maybe this will ring a bell
for someone.

I can tell the only way we enter the split-brain path is through in the
first if statement of afr_read_txn_refresh_done().

This means afr_read_txn_refresh_done() itself was passed "err" and
that it appears thin_arbiter_count was not set (which makes sense,
I'm using 1x3, not a thin arbiter).

So we jump to the readfn label, and read_subvol() should still be -1.
If I read right, it must mean that this if didn't return true because
my print statement didn't appear:
if ((ret == 0) && spb_choice >= 0) {

So we're still with the original read_subvol == 1,
Which gets us to the split_brain message.

So now I will try to learn why afr_read_txn_refresh_done() would have
'err' set in the first place. I will also learn about
afr_inode_split_brain_choice_get(). Those seem to be the two methods to
have avoided falling in to the split brain hole here.


I put debug statements in these locations. I will mark with !! what
I see:



diff -Narup glusterfs-7.2-orig/xlators/cluster/afr/src/afr-read-txn.c 
glusterfs-7.2-new/xlators/cluster/afr/src/afr-read-txn.c
--- glusterfs-7.2-orig/xlators/cluster/afr/src/afr-read-txn.c   2020-01-15 
11:43:53.887894293 -0600
+++ glusterfs-7.2-new/xlators/cluster/afr/src/afr-read-txn.c2020-03-30 
15:45:02.917104321 -0500
@@ -279,10 +279,14 @@ afr_read_txn_refresh_done(call_frame_t *
 priv = this->private;

 if (err) {
-if (!priv->thin_arbiter_count)
+if (!priv->thin_arbiter_count) {
+gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg crapola 1st if in 
afr_read_txn_refresh_done() !priv->thin_arbiter_count -- goto to readfn");
!!
We hit this error condition and jump to readfn below
!!!
 goto readfn;
-if (err != EINVAL)
+}
+if (err != EINVAL) {
+gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj 2nd if in 
afr_read_txn_refresh_done() err != EINVAL, goto readfn");
 goto readfn;
+}
 /* We need to query the good bricks and/or thin-arbiter.*/
 afr_ta_read_txn_synctask(frame, this);
 return 0;
@@ -291,6 +295,8 @@ afr_read_txn_refresh_done(call_frame_t *
 read_subvol = afr_read_subvol_select_by_policy(inode, this, 
local->readable,
NULL);
 if (read_subvol == -1) {
+gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg whoops read_subvol 
returned -1, going to readfn");
+
 err = EIO;
 goto readfn;
 }
@@ -304,11 +310,15 @@ afr_read_txn_refresh_done(call_frame_t *
 readfn:
 if (read_subvol == -1) {
 ret = afr_inode_split_brain_choice_get(inode, this, _choice);
-if ((ret == 0) && spb_choice >= 0)
+if ((ret == 0) && spb_choice >= 0) {
!!
We never get here, afr_inode_split_brain_choice_get() must not have
returned what was needed to enter.
!!
+gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg read_subvol was -1 
to begin with split brain choice found: %d", spb_choice);
 read_subvol = spb_choice;
+}
 }

 if (read_subvol == -1) {
+   gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg verify this shows up 
above split-brain error");
!!
We hit here. Game over player.
!!
+
 AFR_SET_ERROR_AND_CHECK_SPLIT_BRAIN(-1, err);
 }
 afr_read_txn_wind(frame, this, read_subvol);





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-03-30 Thread Erik Jacobson
> Hi Erik,
> Sadly I didn't have the time to take a look in your logs, but I would like to 
> ask you whether you have statiatics of the network bandwidth usage.
> Could it be possible that the gNFS server is  starved for bandwidth and fails 
> to reach all bricks  leading to 'split-brain' errors ?
> 

I understand. I doubt there is a bandwidth issue but I'll add this to my
checks. We have 288 nodes per server normally and they run fine with all
servers up. The 76 number is just what we happened to have access to on
an internal system.

Question: What you mentioned above, and a feeling I have too personally
is -- is the split-brain error actually a generic catch-all error for
not being able to get access to a file? So when it says "split-brain"
could it really mean any type of access error? Could it also be given
when there is a IO timeout or something?

I'm starting to break open the source code to look around but I think my
head will explode before I understand it enough. I will still give it a
shot.

I have access to this system until later tonight. Then it goes away. We
have duplicated it on another system that stays, but the machine
internally is so contended for that I wouldn't get a time slot until
later in the week anyway. Trying to make as much use of this "gift"
machine as I can :) :)

Thanks again for the replies so far.

Erik




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-03-30 Thread Erik Jacobson
Thank you so much for replying --

> > [2020-03-29 03:42:52.295532] E [MSGID: 108008] 
> > [afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: 
> > Failing ACCESS on gfid 8eed77d3-b4fa-4beb-a0e7-e46c2b71ffe1: split-brain 
> > observed. [Input/output error]


> Since you say that the errors go away when all 3 bricks (which I guess is
> what you refer to as 'leaders') of the replica are up, it could be possible

Yes leaders == gluster+gnfs server for this. We use 'leader' internally
for mean servers that help manage compute nodes. I try to convert it to
'server' in my writing but 'leader' slips out somtimes.

> that the brick you brought down had the only good copy. In such cases, even
> though you have the other 2 bricks of the replica up, they both are bad

I think all 3 copies are good. That is because the same exact files are
accessed the same way when nodes boot. With one server down, 76 nodes
normally boot with no errors. Once in a while one fails with split brain
errors in the log. The more load I put in, the more likely a split brain
when one server is down. So that's why my test case is so weird looking.
It has to generate a bunch of extra load and then try to access root
filesystem files using our tools to trigger the split brain. The test
is good in that it produces at least a couple slit-brain errors every
time. I'm actually ver happy to have a test case. We've been dealing
with reports for some time.

The healing errors seen are explained by the writable XFS image files in
gluster -- one per node -- that the nodes use for their /etc, /var, and
so on. So the 76 healing messages were expected. If it would help to
reduce confusion, I can repeat the test with using TMPFS for the
writable areas so that the healing list is clear.

> copies waiting to be healed and hence all operations on those files will
> fail with EIO. Since you say this occurs under high load only. I suspect

To be clear, with one server down, operations work like 99.9% of the time.
Same operations on every node. It's only when we bring the load up
(maybe heavy metadata related?) do we get split-brain errors with one
server down.

It is a strange problem but I don't believe there is a problem with any
copy of any file. Never say never and nothing would make me happier than
being wrong and solving the problem.

I want to thank you so much for writing back. I'm willing to try any
suggestions we come up with.

Erik




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-03-29 Thread Ravishankar N



On 29/03/20 9:40 am, Erik Jacobson wrote:

Hello all,

I am getting split-brain errors in the gnfs nfs.log when 1 gluster
server is down in a 3-brick/3-node gluster volume. It only happens under
intense load.

In the lab, I have a test case that can repeat the problem on a single
subvolume cluster.

  If all leaders are up, we see no errors.


Here are example nfs.log errors:


[2020-03-29 03:42:52.295532] E [MSGID: 108008] 
[afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing 
ACCESS on gfid 8eed77d3-b4fa-4beb-a0e7-e46c2b71ffe1: split-brain observed. 
[Input/output error]

Since you say that the errors go away when all 3 bricks (which I guess 
is what you refer to as 'leaders') of the replica are up, it could be 
possible that the brick you brought down had the only good copy. In such 
cases, even though you have the other 2 bricks of the replica up, they 
both are bad copies waiting to be healed and hence all operations on 
those files will fail with EIO. Since you say this occurs under high 
load only. I suspect this is the case since heal hasn't had the time to 
catch up with the nodes going up and down.


If you see the split-brain errors despite all 3 replica bricks being 
online and the gnfs server being able to connect to all of them, then it 
could be a genuine split-brain problem. But I don't think that is the 
case here.


Regards,
Ravi





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-03-29 Thread Erik Jacobson
Thank you for replying!! Responses below...

I have attached the volume def (meant to before).
I have attached a couple logs from one of the leaders.

> That's  odd.
> As  far as  I know, the client's are accessing  one of the gluster nodes  
> that serves as NFS server and then syncs data across the peers ,right?

Correct, although in this case, with a 1x3, all of them should have
local copies. Our first reports came in from 3x3 (9 server) systems but
we have been able to duplicate on 1x3 thankfully in house. This is a
huge step forward as I had no reproducer previously.

> What happens when the virtual IP(s) are  failed  over to the other gluster 
> node? Is the issue resolved?

While we do use CTDB for managing the IPs aliases, I don't start the test until
the IP is stabilized. I put all 76 nodes on one IP alias to make a more
similar load to what we have in the field.

I think it is important to point out that if I reduce the load, all is
well. For examples, if the test were just booting -- where the initial
reports were seen -- just 1 or 2 nodes out of 1,000 would have an issue
each cycle. They all boot the same way and are all using the same IP
alias for NFS in my test case. So I think the split-brain messages are maybe
a symptom of some sort of timeout ??? (making stuff up here).

> Also, what kind of  load balancing are you using ?
[I moved this question up because the below answer has too much
output]

We are doing very simple balancing - manual balancing. As we add compute
nodes to the cluster, a couple racks are assigned to IP alias #1, the
next couple to IP alias #2, and so on. I'm happy to not have the
complexity of a real load balancer right now.


> Do you get any split brain entries via 'gluster volume geal  info' ?

I ran two trials for the 'gluster volume heal ...'

Trial 1 - with all 3 servers up and while running the load:
[root@leader2 ~]# gluster volume heal cm_shared info
Brick 172.23.0.4:/data/brick_cm_shared
Status: Connected
Number of entries: 0

Brick 172.23.0.5:/data/brick_cm_shared
Status: Connected
Number of entries: 0

Brick 172.23.0.6:/data/brick_cm_shared
Status: Connected
Number of entries: 0


Trial 2 - with 1 server down (stopped glusterd on 1 server) - and
without doing any testing yet -- I see this.  Let me explain though -
not in the error path, I am using RW NFS filesystem image blobs on this
same volume for the writable areas of the node. In the field, we
duplicate the problem with using TMPFS for that writable area. I am
happy to re-do the test with RO NFS and TMPFS for writable, which my
GUESS says the healing messages would go away. Would that help?
If you look at the heal count -- 76 -- that equals the node count - the
number of writable XFS image files using for writing for each node.

[root@leader2 ~]# gluster volume heal cm_shared info
Brick 172.23.0.4:/data/brick_cm_shared
Status: Transport endpoint is not connected
Number of entries: -

Brick 172.23.0.5:/data/brick_cm_shared








Status: Connected
Number of entries: 8

Brick 172.23.0.6:/data/brick_cm_shared








Status: Connected
Number of entries: 8



Trial 3 - ran the heal command around the time the split-brain errors
were being reported


[root@leader2 glusterfs]# gluster volume heal cm_shared info
Brick 172.23.0.4:/data/brick_cm_shared
Status: Transport endpoint is not connected
Number of entries: -

Brick 172.23.0.5:/data/brick_cm_shared












































































Status: Connected
Number of entries: 76

Brick 172.23.0.6:/data/brick_cm_shared












































































Status: Connected
Number of entries: 76

 
Volume Name: cm_shared
Type: Replicate
Volume ID: f6175f56-8422-4056-9891-f9ba84756b87
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 172.23.0.4:/data/brick_cm_shared
Brick2: 172.23.0.5:/data/brick_cm_shared
Brick3: 172.23.0.6:/data/brick_cm_shared
Options Reconfigured:
nfs.event-threads: 3
config.brick-threads: 16
config.client-threads: 16
performance.iot-pass-through: false
config.global-threading: off
performance.client-io-threads: on
nfs.disable: off
storage.fips-mode-rchecksum: on
transport.address-family: inet
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
cluster.lookup-optimize: on
client.event-threads: 32
server.event-threads: 32
performance.stat-prefetch: on
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 100
performance.io-thread-count: 32
performance.cache-size: 8GB
performance.parallel-readdir: on
cluster.lookup-unhashed: auto
performance.flush-behind: on
performance.aggregate-size: 2048KB
performance.write-behind-trickling-writes: off
transport.listen-backlog: 16384
performance.write-behind-window-size: 1024MB
server.outstanding-rpc-limit: 1024
nfs.outstanding-rpc-limit: 1024
nfs.acl: on
storage.max-hardlinks: 0
performance.cache-refresh-timeout: 60