Re: [Gluster-users] brick is down but gluster volume status says it's fine

2017-10-24 Thread Alastair Neil
It looks like this is to do with the stale port issue.

I think it's pretty clear from the below that the digitalcorpora brick
process is shown by volume status as having the same TCP port as the public
volume brick on gluster-2, 49156. But is actually listening on 49154.  So
although the brick process is technically up nothing is talking to it.  I
am surprised I don't see more errors in the brick log for brick8/public.
It also explains the wack-a-mole problem,  Every time I kill and restart
the daemon it must be grabbing the port of another brick and then that
volume brick  goes silent.

I killed all the brick processes and restarted glusterd and everything came
up ok.


[root@gluster-2 ~]# glv status digitalcorpora | grep -v ^Self
Status of volume: digitalcorpora
Gluster process TCP Port  RDMA Port  Online  Pid
--
Brick gluster-2:/export/brick7/digitalcorpo
ra  49156 0  Y
125708
Brick gluster1.vsnet.gmu.edu:/export/brick7
/digitalcorpora 49152 0  Y
12345
Brick gluster0:/export/brick7/digitalcorpor
a   49152 0  Y
16098

Task Status of Volume digitalcorpora
--
There are no active volume tasks

[root@gluster-2 ~]# glv status public  | grep -v ^Self
Status of volume: public
Gluster process TCP Port  RDMA Port  Online  Pid
--
Brick gluster1:/export/brick8/public49156 0  Y
3519
Brick gluster2:/export/brick8/public49156 0  Y
8578
Brick gluster0:/export/brick8/public49156 0  Y
3176

Task Status of Volume public
--
There are no active volume tasks

[root@gluster-2 ~]# netstat -pant | grep 8578 | grep 0.0.0.0
tcp0  0 0.0.0.0:49156   0.0.0.0:*
LISTEN  8578/glusterfsd
[root@gluster-2 ~]# netstat -pant | grep 125708 | grep 0.0.0.0
tcp0  0 0.0.0.0:49154   0.0.0.0:*
LISTEN  125708/glusterfsd
[root@gluster-2 ~]# ps -c  --pid  125708 8578
   PID CLS PRI TTY  STAT   TIME COMMAND
  8578 TS   19 ?Ssl  224:20 /usr/sbin/glusterfsd -s gluster2
--volfile-id public.gluster2.export-brick8-public -p
/var/lib/glusterd/vols/public/run/gluster2-export-bric
125708 TS   19 ?Ssl0:08 /usr/sbin/glusterfsd -s gluster-2
--volfile-id digitalcorpora.gluster-2.export-brick7-digitalcorpora -p
/var/lib/glusterd/vols/digitalcorpor
[root@gluster-2 ~]#


On 24 October 2017 at 13:56, Atin Mukherjee  wrote:

>
>
> On Tue, Oct 24, 2017 at 11:13 PM, Alastair Neil 
> wrote:
>
>> gluster version 3.10.6, replica 3 volume, daemon is present but does not
>> appear to be functioning
>>
>> peculiar behaviour.  If I kill the glusterfs brick daemon and restart
>> glusterd then the brick becomes available - but one of my other volumes
>> bricks on the same server goes down in the same way it's like wack-a-mole.
>>
>> any ideas?
>>
>
> The subject and the data looks to be contradictory to me. Brick log (what
> you shared) doesn't have a cleanup_and_exit () trigger for a shutdown. Are
> you sure brick is down? OTOH, I see a mismatch of port for
> brick7/digitalcorpora where the brick process has 49154 but gluster volume
> status shows 49152. There is an issue with stale port which we're trying to
> address through https://review.gluster.org/18541 . But could you specify
> what exactly the problem is? Is it the stale port  or the conflict between
> volume status output and actual brick health? If it's the latter, I'd need
> further information like output of "gluster get-state" command from the
> same node.
>
>
>>
>> [root@gluster-2 bricks]# glv status digitalcorpora
>>
>>> Status of volume: digitalcorpora
>>> Gluster process TCP Port  RDMA Port
>>> Online  Pid
>>> 
>>> --
>>> Brick gluster-2:/export/brick7/digitalcorpo
>>> ra  49156 0
>>> Y   125708
>>> Brick gluster1.vsnet.gmu.edu:/export/brick7
>>> /digitalcorpora 49152 0
>>> Y   12345
>>> Brick gluster0:/export/brick7/digitalcorpor
>>> a   49152 0
>>> Y   16098
>>> Self-heal Daemon on localhost   N/A   N/AY
>>> 126625
>>> Self-heal Daemon on gluster1N/A   N/AY
>>> 15405
>>> Self-heal Daemon on gluster0N/A   N/AY
>>> 18584
>>>
>>> Task Status of Volume digitalcorpora
>>> 
>>> 

Re: [Gluster-users] brick is down but gluster volume status says it's fine

2017-10-24 Thread Atin Mukherjee
On Tue, Oct 24, 2017 at 11:13 PM, Alastair Neil 
wrote:

> gluster version 3.10.6, replica 3 volume, daemon is present but does not
> appear to be functioning
>
> peculiar behaviour.  If I kill the glusterfs brick daemon and restart
> glusterd then the brick becomes available - but one of my other volumes
> bricks on the same server goes down in the same way it's like wack-a-mole.
>
> any ideas?
>

The subject and the data looks to be contradictory to me. Brick log (what
you shared) doesn't have a cleanup_and_exit () trigger for a shutdown. Are
you sure brick is down? OTOH, I see a mismatch of port for
brick7/digitalcorpora where the brick process has 49154 but gluster volume
status shows 49152. There is an issue with stale port which we're trying to
address through https://review.gluster.org/18541 . But could you specify
what exactly the problem is? Is it the stale port  or the conflict between
volume status output and actual brick health? If it's the latter, I'd need
further information like output of "gluster get-state" command from the
same node.


>
> [root@gluster-2 bricks]# glv status digitalcorpora
>
>> Status of volume: digitalcorpora
>> Gluster process TCP Port  RDMA Port  Online
>> Pid
>> 
>> --
>> Brick gluster-2:/export/brick7/digitalcorpo
>> ra  49156 0  Y
>> 125708
>> Brick gluster1.vsnet.gmu.edu:/export/brick7
>> /digitalcorpora 49152 0  Y
>> 12345
>> Brick gluster0:/export/brick7/digitalcorpor
>> a   49152 0  Y
>> 16098
>> Self-heal Daemon on localhost   N/A   N/AY
>> 126625
>> Self-heal Daemon on gluster1N/A   N/AY
>> 15405
>> Self-heal Daemon on gluster0N/A   N/AY
>> 18584
>>
>> Task Status of Volume digitalcorpora
>> 
>> --
>> There are no active volume tasks
>>
>> [root@gluster-2 bricks]# glv heal digitalcorpora info
>> Brick gluster-2:/export/brick7/digitalcorpora
>> Status: Transport endpoint is not connected
>> Number of entries: -
>>
>> Brick gluster1.vsnet.gmu.edu:/export/brick7/digitalcorpora
>> /.trashcan
>> /DigitalCorpora/hello2.txt
>> /DigitalCorpora
>> Status: Connected
>> Number of entries: 3
>>
>> Brick gluster0:/export/brick7/digitalcorpora
>> /.trashcan
>> /DigitalCorpora/hello2.txt
>> /DigitalCorpora
>> Status: Connected
>> Number of entries: 3
>>
>> [2017-10-24 17:18:48.288505] W [glusterfsd.c:1360:cleanup_and_exit]
>> (-->/lib64/libpthread.so.0(+0x7e25) [0x7f6f83c9de25]
>> -->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xe5) [0x55a148eeb135]
>> -->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x55a148eeaf5b] ) 0-:
>> received signum (15), shutting down
>> [2017-10-24 17:18:59.270384] I [MSGID: 100030] [glusterfsd.c:2503:main]
>> 0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version 3.10.6
>> (args: /usr/sbin/glusterfsd -s gluster-2 --volfile-id
>> digitalcorpora.gluster-2.export-brick7-digitalcorpora -p
>> /var/lib/glusterd/vols/digitalcorpora/run/gluster-2-
>> export-brick7-digitalcorpora.pid -S /var/run/gluster/
>> f8e0b3393e47dc51a07c6609f9b40841.socket --brick-name
>> /export/brick7/digitalcorpora -l /var/log/glusterfs/bricks/
>> export-brick7-digitalcorpora.log --xlator-option *-posix.glusterd-uuid=
>> 032c17f5-8cc9-445f-aa45-897b5a066b43 --brick-port 49154 --xlator-option
>> digitalcorpora-server.listen-port=49154)
>> [2017-10-24 17:18:59.285279] I [MSGID: 101190] 
>> [event-epoll.c:629:event_dispatch_epoll_worker]
>> 0-epoll: Started thread with index 1
>> [2017-10-24 17:19:04.611723] I 
>> [rpcsvc.c:2237:rpcsvc_set_outstanding_rpc_limit]
>> 0-rpc-service: Configured rpc.outstanding-rpc-limit with value 64
>> [2017-10-24 17:19:04.611815] W [MSGID: 101002] 
>> [options.c:954:xl_opt_validate]
>> 0-digitalcorpora-server: option 'listen-port' is deprecated, preferred is
>> 'transport.socket.listen-port', continuing with correction
>> [2017-10-24 17:19:04.615974] W [MSGID: 101174]
>> [graph.c:361:_log_if_unknown_option] 0-digitalcorpora-server: option
>> 'rpc-auth.auth-glusterfs' is not recognized
>> [2017-10-24 17:19:04.616033] W [MSGID: 101174]
>> [graph.c:361:_log_if_unknown_option] 0-digitalcorpora-server: option
>> 'rpc-auth.auth-unix' is not recognized
>> [2017-10-24 17:19:04.616070] W [MSGID: 101174]
>> [graph.c:361:_log_if_unknown_option] 0-digitalcorpora-server: option
>> 'rpc-auth.auth-null' is not recognized
>> [2017-10-24 17:19:04.616134] W [MSGID: 101174]
>> [graph.c:361:_log_if_unknown_option] 0-digitalcorpora-server: option
>> 'auth-path' is not recognized
>> [2017-10-24 17:19:04.616177] W [MSGID: 101174]
>> [graph.c:361:_log_if_unknown_option] 0-digitalcorpora-server: option
>> 'ping-timeout' is not recognized
>> 

[Gluster-users] brick is down but gluster volume status says it's fine

2017-10-24 Thread Alastair Neil
gluster version 3.10.6, replica 3 volume, daemon is present but does not
appear to be functioning

peculiar behaviour.  If I kill the glusterfs brick daemon and restart
glusterd then the brick becomes available - but one of my other volumes
bricks on the same server goes down in the same way it's like wack-a-mole.

any ideas?


[root@gluster-2 bricks]# glv status digitalcorpora

> Status of volume: digitalcorpora
> Gluster process TCP Port  RDMA Port  Online
> Pid
>
> --
> Brick gluster-2:/export/brick7/digitalcorpo
> ra  49156 0  Y
> 125708
> Brick gluster1.vsnet.gmu.edu:/export/brick7
> /digitalcorpora 49152 0  Y
> 12345
> Brick gluster0:/export/brick7/digitalcorpor
> a   49152 0  Y
> 16098
> Self-heal Daemon on localhost   N/A   N/AY
> 126625
> Self-heal Daemon on gluster1N/A   N/AY
> 15405
> Self-heal Daemon on gluster0N/A   N/AY
> 18584
>
> Task Status of Volume digitalcorpora
>
> --
> There are no active volume tasks
>
> [root@gluster-2 bricks]# glv heal digitalcorpora info
> Brick gluster-2:/export/brick7/digitalcorpora
> Status: Transport endpoint is not connected
> Number of entries: -
>
> Brick gluster1.vsnet.gmu.edu:/export/brick7/digitalcorpora
> /.trashcan
> /DigitalCorpora/hello2.txt
> /DigitalCorpora
> Status: Connected
> Number of entries: 3
>
> Brick gluster0:/export/brick7/digitalcorpora
> /.trashcan
> /DigitalCorpora/hello2.txt
> /DigitalCorpora
> Status: Connected
> Number of entries: 3
>
> [2017-10-24 17:18:48.288505] W [glusterfsd.c:1360:cleanup_and_exit]
> (-->/lib64/libpthread.so.0(+0x7e25) [0x7f6f83c9de25]
> -->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xe5) [0x55a148eeb135]
> -->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x55a148eeaf5b] ) 0-:
> received signum (15), shutting down
> [2017-10-24 17:18:59.270384] I [MSGID: 100030] [glusterfsd.c:2503:main]
> 0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version 3.10.6
> (args: /usr/sbin/glusterfsd -s gluster-2 --volfile-id
> digitalcorpora.gluster-2.export-brick7-digitalcorpora -p
> /var/lib/glusterd/vols/digitalcorpora/run/gluster-2-export-brick7-digitalcorpora.pid
> -S /var/run/gluster/f8e0b3393e47dc51a07c6609f9b40841.socket --brick-name
> /export/brick7/digitalcorpora -l
> /var/log/glusterfs/bricks/export-brick7-digitalcorpora.log --xlator-option
> *-posix.glusterd-uuid=032c17f5-8cc9-445f-aa45-897b5a066b43 --brick-port
> 49154 --xlator-option digitalcorpora-server.listen-port=49154)
> [2017-10-24 17:18:59.285279] I [MSGID: 101190]
> [event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll: Started thread
> with index 1
> [2017-10-24 17:19:04.611723] I
> [rpcsvc.c:2237:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured
> rpc.outstanding-rpc-limit with value 64
> [2017-10-24 17:19:04.611815] W [MSGID: 101002]
> [options.c:954:xl_opt_validate] 0-digitalcorpora-server: option
> 'listen-port' is deprecated, preferred is 'transport.socket.listen-port',
> continuing with correction
> [2017-10-24 17:19:04.615974] W [MSGID: 101174]
> [graph.c:361:_log_if_unknown_option] 0-digitalcorpora-server: option
> 'rpc-auth.auth-glusterfs' is not recognized
> [2017-10-24 17:19:04.616033] W [MSGID: 101174]
> [graph.c:361:_log_if_unknown_option] 0-digitalcorpora-server: option
> 'rpc-auth.auth-unix' is not recognized
> [2017-10-24 17:19:04.616070] W [MSGID: 101174]
> [graph.c:361:_log_if_unknown_option] 0-digitalcorpora-server: option
> 'rpc-auth.auth-null' is not recognized
> [2017-10-24 17:19:04.616134] W [MSGID: 101174]
> [graph.c:361:_log_if_unknown_option] 0-digitalcorpora-server: option
> 'auth-path' is not recognized
> [2017-10-24 17:19:04.616177] W [MSGID: 101174]
> [graph.c:361:_log_if_unknown_option] 0-digitalcorpora-server: option
> 'ping-timeout' is not recognized
> [2017-10-24 17:19:04.616203] W [MSGID: 101174]
> [graph.c:361:_log_if_unknown_option] 0-/export/brick7/digitalcorpora:
> option 'rpc-auth-allow-insecure' is not recognized
> [2017-10-24 17:19:04.616215] W [MSGID: 101174]
> [graph.c:361:_log_if_unknown_option] 0-/export/brick7/digitalcorpora:
> option 'auth.addr./export/brick7/digitalcorpora.allow' is not recognized
> [2017-10-24 17:19:04.616226] W [MSGID: 101174]
> [graph.c:361:_log_if_unknown_option] 0-/export/brick7/digitalcorpora:
> option 'auth-path' is not recognized
> [2017-10-24 17:19:04.616237] W [MSGID: 101174]
> [graph.c:361:_log_if_unknown_option] 0-/export/brick7/digitalcorpora:
> option 'auth.login.b17f2513-7d9c-4174-a0c5-de4a752d46ca.password' is not
> recognized
> [2017-10-24 17:19:04.616248] W [MSGID: 101174]
> [graph.c:361:_log_if_unknown_option] 0-/export/brick7/digitalcorpora:
> option