Re: [Gluster-users] Geo-replication completely broken

Felix Kölzow Thu, 25 Jun 2020 00:57:24 -0700

Hey Rob,


same issue for our third volume. Have a look at the logs just from right
now (below).

Question: You removed the htime files and the old changelogs. Just rm
the files or is there something to pay more attention

before removing the changelog files and the htime file.

Regards,

Felix

[2020-06-25 07:51:53.795430] I [resource(worker
/gluster/vg00/dispersed_fuse1024/brick):1435:connect_remote] SSH: SSH
connection between master and slave established. duration=1.2341
[2020-06-25 07:51:53.795639] I [resource(worker
/gluster/vg00/dispersed_fuse1024/brick):1105:connect] GLUSTER: Mounting
gluster volume locally...
[2020-06-25 07:51:54.520601] I [monitor(monitor):280:monitor] Monitor:
worker died in startup phase brick=/gluster/vg01/dispersed_fuse1024/brick
[2020-06-25 07:51:54.535809] I
[gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker
Status Change    status=Faulty
[2020-06-25 07:51:54.882143] I [resource(worker
/gluster/vg00/dispersed_fuse1024/brick):1128:connect] GLUSTER: Mounted
gluster volume    duration=1.0864
[2020-06-25 07:51:54.882388] I [subcmds(worker
/gluster/vg00/dispersed_fuse1024/brick):84:subcmd_worker] <top>: Worker
spawn successful. Acknowledging back to monitor
[2020-06-25 07:51:56.911412] E [repce(agent
/gluster/vg00/dispersed_fuse1024/brick):121:worker] <top>: call failed:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 117,
in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py",
line 40, in register
    return Changes.cl_register(cl_brick, cl_dir, cl_log, cl_level, retries)
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
line 46, in cl_register
    cls.raise_changelog_err()
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
line 30, in raise_changelog_err
    raise ChangelogException(errn, os.strerror(errn))
ChangelogException: [Errno 2] No such file or directory
[2020-06-25 07:51:56.912056] E [repce(worker
/gluster/vg00/dispersed_fuse1024/brick):213:__call__] RepceClient: call
failed    call=75086:140098349655872:1593071514.91 method=register   
error=ChangelogException
[2020-06-25 07:51:56.912396] E [resource(worker
/gluster/vg00/dispersed_fuse1024/brick):1286:service_loop] GLUSTER:
Changelog register failed    error=[Errno 2] No such file or directory
[2020-06-25 07:51:56.928031] I [repce(agent
/gluster/vg00/dispersed_fuse1024/brick):96:service_loop] RepceServer:
terminating on reaching EOF.
[2020-06-25 07:51:57.886126] I [monitor(monitor):280:monitor] Monitor:
worker died in startup phase brick=/gluster/vg00/dispersed_fuse1024/brick
[2020-06-25 07:51:57.895920] I
[gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker
Status Change    status=Faulty
[2020-06-25 07:51:58.607405] I [gsyncdstatus(worker
/gluster/vg00/dispersed_fuse1024/brick):287:set_passive] GeorepStatus:
Worker Status Change    status=Passive
[2020-06-25 07:51:58.607768] I [gsyncdstatus(worker
/gluster/vg01/dispersed_fuse1024/brick):287:set_passive] GeorepStatus:
Worker Status Change    status=Passive
[2020-06-25 07:51:58.608004] I [gsyncdstatus(worker
/gluster/vg00/dispersed_fuse1024/brick):281:set_active] GeorepStatus:
Worker Status Change    status=Active


On 25/06/2020 09:15, rob.quaglio...@rabobank.com wrote:


Hi All,

We’ve got two six node RHEL 7.8 clusters and geo-replication would
appear to be completely broken between them. I’ve deleted the session,
removed & recreated pem files, old changlogs/htime (after removing
relevant options from volume) and completely set up geo-rep from
scratch, but the new session comes up as Initializing, then goes
faulty, and starts looping. Volume (on both sides) is a 4 x 2
disperse, running Gluster v6 (RH latest).  Gsyncd reports:

[2020-06-25 07:07:14.701423] I
[gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker
Status Change status=Initializing...

[2020-06-25 07:07:14.701744] I [monitor(monitor):159:monitor] Monitor:
starting gsyncd worker   brick=/rhgs/brick20/brick
slave_node=bxts470194.eu.rabonet.com

[2020-06-25 07:07:14.707997] D [monitor(monitor):230:monitor] Monitor:
Worker would mount volume privately

[2020-06-25 07:07:14.757181] I [gsyncd(agent
/rhgs/brick20/brick):318:main] <top>: Using session config file
path=/var/lib/glusterd/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/gsyncd.conf

[2020-06-25 07:07:14.758126] D [subcmds(agent
/rhgs/brick20/brick):107:subcmd_agent] <top>: RPC FD     
rpc_fd='5,12,11,10'

[2020-06-25 07:07:14.758627] I [changelogagent(agent
/rhgs/brick20/brick):72:__init__] ChangelogAgent: Agent listining...

[2020-06-25 07:07:14.764234] I [gsyncd(worker
/rhgs/brick20/brick):318:main] <top>: Using session config file
path=/var/lib/glusterd/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/gsyncd.conf

[2020-06-25 07:07:14.779409] I [resource(worker
/rhgs/brick20/brick):1386:connect_remote] SSH: Initializing SSH
connection between master and slave...

[2020-06-25 07:07:14.841793] D [repce(worker
/rhgs/brick20/brick):195:push] RepceClient: call
6799:140380783982400:1593068834.84 __repce_version__() ...

[2020-06-25 07:07:16.148725] D [repce(worker
/rhgs/brick20/brick):215:__call__] RepceClient: call
6799:140380783982400:1593068834.84 __repce_version__ -> 1.0

[2020-06-25 07:07:16.148911] D [repce(worker
/rhgs/brick20/brick):195:push] RepceClient: call
6799:140380783982400:1593068836.15 version() ...

[2020-06-25 07:07:16.149574] D [repce(worker
/rhgs/brick20/brick):215:__call__] RepceClient: call
6799:140380783982400:1593068836.15 version -> 1.0

[2020-06-25 07:07:16.149735] D [repce(worker
/rhgs/brick20/brick):195:push] RepceClient: call
6799:140380783982400:1593068836.15 pid() ...

[2020-06-25 07:07:16.150588] D [repce(worker
/rhgs/brick20/brick):215:__call__] RepceClient: call
6799:140380783982400:1593068836.15 pid -> 30703

[2020-06-25 07:07:16.150747] I [resource(worker
/rhgs/brick20/brick):1435:connect_remote] SSH: SSH connection between
master and slave established. duration=1.3712

[2020-06-25 07:07:16.150819] I [resource(worker
/rhgs/brick20/brick):1105:connect] GLUSTER: Mounting gluster volume
locally...

[2020-06-25 07:07:16.265860] D [resource(worker
/rhgs/brick20/brick):879:inhibit] DirectMounter: auxiliary glusterfs
mount in place

[2020-06-25 07:07:17.272511] D [resource(worker
/rhgs/brick20/brick):953:inhibit] DirectMounter: auxiliary glusterfs
mount prepared

[2020-06-25 07:07:17.272708] I [resource(worker
/rhgs/brick20/brick):1128:connect] GLUSTER: Mounted gluster
volume      duration=1.1218

[2020-06-25 07:07:17.272794] I [subcmds(worker
/rhgs/brick20/brick):84:subcmd_worker] <top>: Worker spawn successful.
Acknowledging back to monitor

[2020-06-25 07:07:17.272973] D [master(worker
/rhgs/brick20/brick):104:gmaster_builder] <top>: setting up change
detection mode mode=xsync

[2020-06-25 07:07:17.273063] D [monitor(monitor):273:monitor] Monitor:
worker(/rhgs/brick20/brick) connected

[2020-06-25 07:07:17.273678] D [master(worker
/rhgs/brick20/brick):104:gmaster_builder] <top>: setting up change
detection mode mode=changelog

[2020-06-25 07:07:17.274224] D [master(worker
/rhgs/brick20/brick):104:gmaster_builder] <top>: setting up change
detection mode mode=changeloghistory

[2020-06-25 07:07:17.276484] D [repce(worker
/rhgs/brick20/brick):195:push] RepceClient: call
6799:140380783982400:1593068837.28 version() ...

[2020-06-25 07:07:17.276916] D [repce(worker
/rhgs/brick20/brick):215:__call__] RepceClient: call
6799:140380783982400:1593068837.28 version -> 1.0

[2020-06-25 07:07:17.277009] D [master(worker
/rhgs/brick20/brick):777:setup_working_dir] _GMaster: changelog
working dir
/var/lib/misc/gluster/gsyncd/prd_mx_intvol_bxts470190_prd_mx_intvol/rhgs-brick20-brick

[2020-06-25 07:07:17.277098] D [repce(worker
/rhgs/brick20/brick):195:push] RepceClient: call
6799:140380783982400:1593068837.28 init() ...

[2020-06-25 07:07:17.292944] D [repce(worker
/rhgs/brick20/brick):215:__call__] RepceClient: call
6799:140380783982400:1593068837.28 init -> None

[2020-06-25 07:07:17.293097] D [repce(worker
/rhgs/brick20/brick):195:push] RepceClient: call
6799:140380783982400:1593068837.29 register('/rhgs/brick20/brick',
'/var/lib/misc/gluster/gsyncd/prd_mx_intvol_bxts470190_prd_mx_intvol/rhgs-brick20-brick',
'/var/log/glusterfs/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/changes-rhgs-brick20-brick.log',
8, 5) ...

[2020-06-25 07:07:19.296294] E [repce(agent
/rhgs/brick20/brick):121:worker] <top>: call failed:

Traceback (most recent call last):

  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 117,
in worker

    res = getattr(self.obj, rmeth)(*in_data[2:])

  File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py",
line 40, in register

    return Changes.cl_register(cl_brick, cl_dir, cl_log, cl_level,
retries)

  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
line 46, in cl_register

    cls.raise_changelog_err()

  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
line 30, in raise_changelog_err

    raise ChangelogException(errn, os.strerror(errn))

ChangelogException: [Errno 2] No such file or directory

[2020-06-25 07:07:19.297161] E [repce(worker
/rhgs/brick20/brick):213:__call__] RepceClient: call failed       
call=6799:140380783982400:1593068837.29 method=register
error=ChangelogException

[2020-06-25 07:07:19.297338] E [resource(worker
/rhgs/brick20/brick):1286:service_loop] GLUSTER: Changelog register
failed      error=[Errno 2] No such file or directory

[2020-06-25 07:07:19.315074] I [repce(agent
/rhgs/brick20/brick):96:service_loop] RepceServer: terminating on
reaching EOF.

[2020-06-25 07:07:20.275701] I [monitor(monitor):280:monitor] Monitor:
worker died in startup phase     brick=/rhgs/brick20/brick

[2020-06-25 07:07:20.277383] I
[gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker
Status Change status=Faulty

We’ve done everything we can think of, including an “strace –f” on the
pid, and we can’t really find anything. I’m about to lose the last of
my hair over this, so does anyone have any ideas at all? We’ve even
removed the entire slave vol and rebuilt it.

Thanks

Rob

*Rob Quagliozzi*

*Specialised Application Support*



------------------------------------------------------------------------
This email (including any attachments to it) is confidential, legally
privileged, subject to copyright and is sent for the personal
attention of the intended recipient only. If you have received this
email in error, please advise us immediately and delete it. You are
notified that disclosing, copying, distributing or taking any action
in reliance on the contents of this information is strictly
prohibited. Although we have taken reasonable precautions to ensure no
viruses are present in this email, we cannot accept responsibility for
any loss or damage arising from the viruses in this email or
attachments. We exclude any liability for the content of this email,
or for the consequences of any actions taken on the basis of the
information provided in this email or its attachments, unless that
information is subsequently confirmed in writing. <#rbnl#1898i>
------------------------------------------------------------------------


________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Geo-replication completely broken

Reply via email to