Re: [Gluster-users] glusterfs geo-replication

Thibault Godouet Wed, 26 Aug 2015 04:58:59 -0700

Hi,

Given that this is an 'OSError', I should probably have said we use Redhat
6.6 64bit and XFS as the brick filesystem.


Has anyone any input on how to troubleshoot this?

Thanks,
Thibault.
On 24 Aug 2015 4:03 pm, "Thibault Godouet" <[email protected]>
wrote:

> I have had multiple issues with geo-replication.  It seems to work OK
> initially, the replica gets up to date, and not long after (e.g. a couple
> of days), the replication goes into a faulty state and won't get out of it.
>
>
> I have tried a few times now, and last attempt I re-created the slave
> volume and setup the replication again.  Same symptoms again.
>
>
>
> I use Gluster 3.7.3, and you will find my setup and log messages at the
> bottom of the email.
>
>
> Any idea what could cause this and how to fix it?
>
> Thanks,
> Thibault.
>
> ps: my setup and log messages:
>
>
>
> Master:
>
>
>
> Volume Name: home
> Type: Replicate
> Volume ID: 2299a204-a1dc-449d-8556-bc65197373c7
> Status: Started
> Number of Bricks: 1 x 2 = 2
> Transport-type: tcp
> Bricks:
> Brick1: server4.uberit.net:/gluster/home-brick-1
> Brick2: server5.uberit.net:/gluster/home-brick-1
> Options Reconfigured:
> performance.readdir-ahead: on
> geo-replication.indexing: on
> geo-replication.ignore-pid-check: on
> changelog.changelog: on
>
>
>
> Slave:
>
> Volume Name: homegs
> Type: Distribute
> Volume ID: 746dfdc3-650d-4468-9fdd-d621dd215b94
> Status: Started
> Number of Bricks: 1
> Transport-type: tcp
> Bricks:
> Brick1: remoteserver1.uberit.net:/gluster/homegs-brick-1/brick
> Options Reconfigured:
> performance.readdir-ahead: on
>
>
>
> The geo-replication status and config (I think I ended up with only
> defaults values) are:
>
>
>
> # gluster volume geo-replication home ssh://remoteserver1::homegs status
>
> MASTER NODE       MASTER VOL    MASTER BRICK             SLAVE USER
> SLAVE                           SLAVE NODE    STATUS    CRAWL STATUS
> LAST_SYNCED
>
> ---------------------------------------------------------------------------------------------------------------------------------------------------------
> server5          home          /gluster/home-brick-1    root
> ssh://remoteserver1::homegs      N/A           Faulty    N/A             N/A
> server4          home          /gluster/home-brick-1    root
> ssh://remoteserver1::homegs      N/A           Faulty    N/A             N/A
>
>
>
> # gluster volume geo-replication home ssh://remoteserver1::homegs config
> special_sync_mode: partial
> state_socket_unencoded:
> /var/lib/glusterd/geo-replication/home_remoteserver1_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.socket
> gluster_log_file:
> /var/log/glusterfs/geo-replication/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.gluster.log
> ssh_command: ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i
> /var/lib/glusterd/geo-replication/secret.pem
> ignore_deletes: false
> change_detector: changelog
> gluster_command_dir: /usr/sbin/
> georep_session_working_dir:
> /var/lib/glusterd/geo-replication/home_remoteserver1_homegs/
> state_file:
> /var/lib/glusterd/geo-replication/home_remoteserver1_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.status
> remote_gsyncd: /nonexistent/gsyncd
> session_owner: 2299a204-a1dc-449d-8556-bc65197373c7
> changelog_log_file:
> /var/log/glusterfs/geo-replication/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs-changes.log
> socketdir: /var/run/gluster
> working_dir:
> /var/lib/misc/glusterfsd/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs
> state_detail_file:
> /var/lib/glusterd/geo-replication/home_remoteserver1_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs-detail.status
> ssh_command_tar: ssh -oPasswordAuthentication=no
> -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/tar_ssh.pem
> pid_file:
> /var/lib/glusterd/geo-replication/home_remoteserver_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.pid
> log_file:
> /var/log/glusterfs/geo-replication/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.log
> gluster_params: aux-gfid-mount acl
> volume_id: 2299a204-a1dc-449d-8556-bc65197373c7
>
>
>
> The logs look like on the master on server1:
>
>
>
> [2015-08-24 15:21:07.955600] I [monitor(monitor):221:monitor] Monitor:
> ------------------------------------------------------------
> [2015-08-24 15:21:07.955883] I [monitor(monitor):222:monitor] Monitor:
> starting gsyncd worker
> [2015-08-24 15:21:08.69528] I [gsyncd(/gluster/home-brick-1):649:main_i]
> <top>: syncing: gluster://localhost:home ->
> ssh://[email protected]:gluster://localhost:homegs
> [2015-08-24 15:21:08.70938] I [changelogagent(agent):75:__init__]
> ChangelogAgent: Agent listining...
> [2015-08-24 15:21:11.255237] I
> [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up xsync
> change detection mode
> [2015-08-24 15:21:11.255532] I
> [master(/gluster/home-brick-1):404:__init__] _GMaster: using 'rsync' as the
> sync engine
> [2015-08-24 15:21:11.256570] I
> [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up
> changelog change detection mode
> [2015-08-24 15:21:11.256726] I
> [master(/gluster/home-brick-1):404:__init__] _GMaster: using 'rsync' as the
> sync engine
> [2015-08-24 15:21:11.257345] I
> [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up
> changeloghistory change detection mode
> [2015-08-24 15:21:11.257534] I
> [master(/gluster/home-brick-1):404:__init__] _GMaster: using 'rsync' as the
> sync engine
> [2015-08-24 15:21:13.333628] I
> [master(/gluster/home-brick-1):1212:register] _GMaster: xsync temp
> directory:
> /var/lib/misc/glusterfsd/home/ssh%3A%2F%2Froot%40172.18.0.169%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs/62d98e8cc00a34eb85b4fe6d6fd3ba33/xsync
> [2015-08-24 15:21:13.333870] I
> [resource(/gluster/home-brick-1):1432:service_loop] GLUSTER: Register time:
> 1440426073
> [2015-08-24 15:21:13.401132] I
> [master(/gluster/home-brick-1):523:crawlwrap] _GMaster: primary master with
> volume id 2299a204-a1dc-449d-8556-bc65197373c7 ...
> [2015-08-24 15:21:13.412795] I
> [master(/gluster/home-brick-1):532:crawlwrap] _GMaster: crawl interval: 1
> seconds
> [2015-08-24 15:21:13.427340] I [master(/gluster/home-brick-1):1127:crawl]
> _GMaster: starting history crawl... turns: 1, stime: (1440411353, 0)
> [2015-08-24 15:21:14.432327] I [master(/gluster/home-brick-1):1156:crawl]
> _GMaster: slave's time: (1440411353, 0)
> [2015-08-24 15:21:14.890889] E [repce(/gluster/home-brick-1):207:__call__]
> RepceClient: call 20960:140215190427392:1440426074.56 (entry_ops) failed on
> peer with OSError
> [2015-08-24 15:21:14.891124] E
> [syncdutils(/gluster/home-brick-1):276:log_raise_exception] <top>: FAIL:
> Traceback (most recent call last):
>   File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 165, in
> main
>     main_i()
>   File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 659, in
> main_i
>     local.service_loop(*[r for r in [remote] if r])
>   File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1438,
> in service_loop
>     g3.crawlwrap(oneshot=True)
>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 584, in
> crawlwrap
>     self.crawl()
>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1165, in
> crawl
>     self.changelogs_batch_process(changes)
>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1074, in
> changelogs_batch_process
>     self.process(batch)
>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 952, in
> process
>     self.process_change(change, done, retry)
>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 907, in
> process_change
>     failures = self.slave.server.entry_ops(entries)
>   File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in
> __call__
>     return self.ins(self.meth, *a)
>   File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in
> __call__
>     raise res
> OSError: [Errno 5] Input/output error
> [2015-08-24 15:21:14.892291] I
> [syncdutils(/gluster/home-brick-1):220:finalize] <top>: exiting.
> [2015-08-24 15:21:14.893665] I [repce(agent):92:service_loop] RepceServer:
> terminating on reaching EOF.
> [2015-08-24 15:21:14.893879] I [syncdutils(agent):220:finalize] <top>:
> exiting.
> [2015-08-24 15:21:15.259360] I [monitor(monitor):282:monitor] Monitor:
> worker(/gluster/home-brick-1) died in startup phase
>
>
>
> and on master server2:
>
>
>
> [2015-08-24 15:21:07.650707] I [monitor(monitor):221:monitor] Monitor:
> ------------------------------------------------------------
> [2015-08-24 15:21:07.651144] I [monitor(monitor):222:monitor] Monitor:
> starting gsyncd worker
> [2015-08-24 15:21:07.764817] I [gsyncd(/gluster/home-brick-1):649:main_i]
> <top>: syncing: gluster://localhost:home ->
> ssh://[email protected]:gluster://localhost:homegs
> [2015-08-24 15:21:07.768552] I [changelogagent(agent):75:__init__]
> ChangelogAgent: Agent listining...
> [2015-08-24 15:21:11.9820] I
> [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up xsync
> change detection mode
> [2015-08-24 15:21:11.10199] I [master(/gluster/home-brick-1):404:__init__]
> _GMaster: using 'rsync' as the sync engine
> [2015-08-24 15:21:11.10946] I
> [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up
> changelog change detection mode
> [2015-08-24 15:21:11.11115] I [master(/gluster/home-brick-1):404:__init__]
> _GMaster: using 'rsync' as the sync engine
> [2015-08-24 15:21:11.11744] I
> [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up
> changeloghistory change detection mode
> [2015-08-24 15:21:11.11933] I [master(/gluster/home-brick-1):404:__init__]
> _GMaster: using 'rsync' as the sync engine
> [2015-08-24 15:21:13.59192] I
> [master(/gluster/home-brick-1):1212:register] _GMaster: xsync temp
> directory:
> /var/lib/misc/glusterfsd/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs/62d98e8cc00a34eb85b4fe6d6fd3ba33/xsync
> [2015-08-24 15:21:13.59454] I
> [resource(/gluster/home-brick-1):1432:service_loop] GLUSTER: Register time:
> 1440426073
> [2015-08-24 15:21:13.113203] I
> [master(/gluster/home-brick-1):523:crawlwrap] _GMaster: primary master with
> volume id 2299a204-a1dc-449d-8556-bc65197373c7 ...
> [2015-08-24 15:21:13.122018] I
> [master(/gluster/home-brick-1):532:crawlwrap] _GMaster: crawl interval: 1
> seconds
> [2015-08-24 15:21:23.209912] E [repce(/gluster/home-brick-1):207:__call__]
> RepceClient: call 1561:140164806457088:1440426083.11 (keep_alive) failed on
> peer with OSError
> [2015-08-24 15:21:23.210119] E
> [syncdutils(/gluster/home-brick-1):276:log_raise_exception] <top>: FAIL:
> Traceback (most recent call last):
>   File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 306,
> in twrap
>     tf(*aa)
>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 438, in
> keep_alive
>     cls.slave.server.keep_alive(vi)
>   File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in
> __call__
>     return self.ins(self.meth, *a)
>   File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in
> __call__
>     raise res
> OSError: [Errno 22] Invalid argument
> [2015-08-24 15:21:23.210975] I
> [syncdutils(/gluster/home-brick-1):220:finalize] <top>: exiting.
> [2015-08-24 15:21:23.212455] I [repce(agent):92:service_loop] RepceServer:
> terminating on reaching EOF.
> [2015-08-24 15:21:23.212707] I [syncdutils(agent):220:finalize] <top>:
> exiting.
> [2015-08-24 15:21:24.23336] I [monitor(monitor):282:monitor] Monitor:
> worker(/gluster/home-brick-1) died in startup phase
>
>
>
> and on the slave (in a different timezone, one hour behind):
>
>
>
> [2015-08-24 14:22:02.923098] I [resource(slave):844:service_loop] GLUSTER:
> slave listening
> [2015-08-24 14:22:07.606774] E [repce(slave):117:worker] <top>: call
> failed:
> Traceback (most recent call last):
>   File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in
> worker
>     res = getattr(self.obj, rmeth)(*in_data[2:])
>   File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 731,
> in entry_ops
>     [ESTALE, EINVAL])
>   File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 475,
> in errno_wrap
>     return call(*arg)
>   File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 78,
> in lsetxattr
>     cls.raise_oserr()
>   File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37,
> in raise_oserr
>     raise OSError(errn, os.strerror(errn))
> OSError: [Errno 5] Input/output error
> [2015-08-24 14:22:07.652092] I [repce(slave):92:service_loop] RepceServer:
> terminating on reaching EOF.
> [2015-08-24 14:22:07.652364] I [syncdutils(slave):220:finalize] <top>:
> exiting.
>
>
>
> --------------
> G-RESEARCH believes the information provided herein is reliable. While
> every care has been taken to ensure accuracy, the information is furnished
> to the recipients with no warranty as to the completeness and accuracy of
> its contents and on condition that any errors or omissions shall not be
> made the basis of any claim, demand or cause of action.
> The information in this email is intended only for the named recipient.
> If you are not the intended recipient please notify us immediately and do
> not copy, distribute or take action based on this e-mail.
> All messages sent to and from this e-mail address will be logged by
> G-RESEARCH and are subject to archival storage, monitoring, review and
> disclosure.
> G-RESEARCH is the trading name of Trenchant Limited, 5th Floor,
> Whittington House, 19-30 Alfred Place, London WC1E 7EA.
> Trenchant Limited is a company registered in England with company number
> 08127121.
> --------------
>

_______________________________________________
Gluster-users mailing list
[email protected]
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] glusterfs geo-replication

Reply via email to