So the version of rsync is 3.1.0, but the bug mentioned only applies to
large files, where as in my case the files are less than a MB.
I've started digging through the logs and found a bunch of these on the
slave:
[2015-11-20 11:40:46.730805] W [fuse-bridge.c:1978:fuse_create_cbk]
0-glusterfs-fuse: 1882288:
/.gfid/31d66429-c700-4a10-bb32-35e1b36a479f =>
-1 (Operation not permitted)
[2015-11-20 12:39:59.269844] W [fuse-bridge.c:1978:fuse_create_cbk]
0-glusterfs-fuse: 1918306:
/.gfid/6802a0c6-1f62-4213-a70d-7b46d9ff8f3a =>
-1 (Operation not permitted)
So something funky was happening for an hour 4 days ago. Given the
volume
is on EBS, maybe there was some glitch there.
I can also find the corresponding failures on the master:
[2015-11-20 11:40:14.93090] W [master(/data/media):803:log_failures]
_GMaster: ENTRY FAILED: ({'uid': 33, 'gfid':
'31d66429-c700-4a10-bb32-35e1b36a479f', 'gid': 33, 'mode': 33206,
'entry':
'.gfid/b1dc6c6d-dac7-4da9-9577-4614942a72a0/official-nightmare-before-christmas-vampire-teddy-girls-dress-body-web.jpg',
'op': 'CREATE'}, 17, 'df0e67f5-f2ce-45c3-b4f1-224aa3059ec7')
[2015-11-20 11:40:14.265054] W [master(/data/media):803:log_failures]
_GMaster: META FAILED: ({'go':
'.gfid/31d66429-c700-4a10-bb32-35e1b36a479f', 'stat': {'atime':
1448019600.232466, 'gid': 33, 'mtime': 1448019600.316466, 'mode': 33279,
'uid': 33}, 'op': 'META'}, 2)
If I grep for SKIPPED GFID I get the following:
[2015-11-20 11:40:40.704817] W [master(/data/media):1014:process]
_GMaster:
SKIPPED GFID =
192632af-28c5-4e03-a62d-458fe7f3b5f9,7ea8d7a8-524b-4dd0-b97a-dc7d3481f341,204f6112-0e8d-4f6d-855b-bf10f9c63b62,7e626e8f-edad-4f39-a6c6-547a1da34aa1,1f0d0208-1962-4eb1-91d4-cf7ed297d8e3,95d389c4-3258-4ca0-8fc4-26b8427b1eaf,425cedc6-6343-4326-8540-996d2d56dc9c,5955928b-2b8f-4cc9-a336-3eac4382789b,8932efcd-ba90-46ec-84c8-5e9e51cc84e9,2530275d-5f03-4143-9abf-d07cc79bf80a,73574466-86f3-4ab2-b5da-c31ac28c27c1,776e5e8f-5c6a-46b1-ad54-733e157d2097,008a69f3-217c-4dbc-a469-5a5bc8ecd589,dca8d8d9-03cf-4793-92e4-bfcfddd262f6,c85b7a29-73af-4f44-a07e-a44082d7a93a,6c1f56d6-4ea6-4910-9677-ea33edd35d28,0ea56588-87fa-4355-9403-e311525454fc,c8ce76c9-e21d-46ce-a2b5-14dfd0070f64,db9e6484-0e5e-4f6e-815b-3c2b273deee5,35d10752-43b5-4398-be5f-17cb9de73a6b,396e5faf-74a1-4849-97e3-009dbfb22836,d148e7d5-c2f3-4d06-8cd6-8588e6aac196,404d20c5-1c6c-4aad-98be-2c23930173b3,f1fae11c-db8e-4cd5-8e47-a3870316f89c,d8daa413-e57f-44fb-b907-b1a497f2dcfa,5f6ee8c2-84fb-432e-95cd-e428ab256e83,6bf54dcd-c3b4-4187-a390-eca841e46570,335c07ca-d339-4d3a-aa88-3b5753d24fbf,8fdbac00-6628-4f22-8fb4-b7a6524cae49,31d66429-c700-4a10-bb32-35e1b36a479f
[2015-11-20 11:41:35.907850] W [master(/data/media):1014:process]
_GMaster:
SKIPPED GFID =
03069c7f-8eaa-45b0-92ed-50cb648cd912,788f5ed1-923e-4b86-9696-2a6de07ebb2e,43d12b40-b6e2-43c4-8883-85e89dc81321
[2015-11-20 12:11:55.492068] W [master(/data/media):1014:process]
_GMaster:
SKIPPED GFID =
eb02369f-7ca8-480a-b00c-768964410ed8,17045ac9-27dd-4bf9-9f90-d7b146070dd5,265e3d9c-1657-45cb-bbf6-db439eb18ccf,553c420f-b3cc-47f2-8d5f-cfc2ffdd1a92
[2015-11-20 12:12:53.372432] W [master(/data/media):1014:process]
_GMaster:
SKIPPED GFID =
66c5878e-8c00-4f7d-a3ad-4adec84a5e22,f4dc086d-9c2b-449c-9e31-bbae9ebcdea7,f99317b2-72e8-49e3-b676-647abad508b1
[2015-11-20 12:37:55.773813] W [master(/data/media):1014:process]
_GMaster:
SKIPPED GFID =
4af54f1c-e8e1-4915-9328-a458d5d35d5d,acbe1f12-87e8-4192-b864-d90030269bba,7d27a795-da63-4742-9e91-abd8fa543612,8d4e642d-fd40-44d6-8419-8d3459df7ce3
[2015-11-20 12:39:28.852575] W [master(/data/media):1014:process]
_GMaster:
SKIPPED GFID =
d90dc121-02e7-4a79-bc03-1bd8fddd9f48,54bb563f-ab44-4e91-a46b-764a122ce7fa,088141de-7545-40f9-b776-751738a89740,2dab3faf-4a6c-407a-88cd-cddef6f55299,d887806f-23b4-4389-a4dc-f9027702a2df,fc5a9bc8-ea62-4677-baed-16510541373a,33136ad2-c5b4-448c-991d-1e72fefef021,cf3e2675-e41b-4782-9478-91773eb0a4aa,6412d878-e0f1-4700-84df-05f4af35962f,ec3cf6e1-7f27-4650-b978-8a5a7f620389,d3651bb9-cd2d-4c5f-93e6-fe4fb1cdf5db,ecb0415e-1524-40f4-870e-1fd0f8371b1d,a118aaae-bd3e-4b19-a0e0-891aa9edb09a,7642d3f3-f1e5-4aca-bcfe-bdb3c44779a9,2e29f3f8-c460-48eb-9db5-b281b67cc2bf,e61db54b-3979-488a-8789-a5d0615c5a97,4212d840-9c22-4d9e-b61b-5e35271dfe80,dad1c60b-9da6-4e57-b014-daa1aca73ce3,93699a3d-40b8-4bbd-b78f-aabf965df57f,4fad7468-91f2-4deb-aaf7-6401068c9e6d,c9738295-46cc-4fe7-b359-dc94f5815ce9,91853c5c-4877-4c9e-9481-c86368942f78,59deed8e-d3d0-4ab7-854e-53a8dd455de0,20b86c13-7df1-4d13-bac1-7d628a00d6ce,b7b86a2d-7963-41a4-a423-14e25d1e78c4,3c17d7fe-bb7f-489c-a525-5c8b7bb93c3e,e230d207-7c68-4983-a958-f2dcfc1ce694,fa8bf3c0-abae-446c-83c5-45ef8bcaa4b8,14089102-8106-45d9-a3f1-d1446b568f4e,6802a0c6-1f62-4213-a70d-7b46d9ff8f3a,0a253bbc-ef98-4da0-951f-e17c5a7f5858,ef054b76-986b-4a89-b8e6-b4988221aaa2,48c0a153-708c-44ee-b186-cf255936a02b,fa2646a6-807c-4e9d-8f2b-a9cdf2674e0c,1ed4a563-4f6a-4b5a-9866-89025fe7afd5,0f293cf7-bc32-4f8a-87d5-388a4bffb4af,f4126726-667b-451d-8214-a18bb3f468cd,e23dc8b3-da1c-4d18-aec9-22e0aa174d81,40b9f10d-7304-4c0b-8498-bef23b305d03,15c25d1e-2a62-495e-887f-14d0cb0527b1,67371804-9084-4801-b664-44e88bea8ac3,4750fa3f-d1a4-4472-b10d-3f75d0b451dc
[2015-11-23 09:18:10.43391] W [master(/data/media):1014:process]
_GMaster:
SKIPPED GFID =
228843f3-62f0-4687-b5eb-6d1e21257ad0,b0078359-fbf0-4709-8f40-8383a11d7875,60cff4d5-8b5d-4f7f-8bc1-27081a011458,bedb6ac4-208d-47e1-812c-5547c84ab841,da6810d9-4883-45e1-b73e-55a7ff17b5e7,e03b5c03-b25c-49ba-86f0-8a709a9c2658,053673a0-c1cc-4057-83fa-f97740cb5d4f,dbd6ea84-8f24-4a47-ac41-22c3fd788ecf,43caa3e7-ca04-47ab-b950-105606b313a4,62d8b1d0-fc89-4fb1-a41a-957dcb34d325,4e8fe1fa-60cd-47fa-bad6-f617c312f53b,6c3d6cf3-62ae-4ab8-9dc3-7815552401fe,f79be814-7e78-4985-bcdd-688da23d1808,c4186455-0f06-4b5d-89be-3c5ccbdeb6f0,f9c4ccdb-2337-479d-845d-ee4d85b69ece,bcd14726-1bab-4d97-8915-ec8bbe8faf8c,cca82341-a430-4a59-a900-1af66dcf7bb8,b7043a8e-4286-4831-91ec-c146e40bc6be,995ffeb6-a906-4078-88c6-404a2b38aad4,227f9987-5057-4133-848a-2b22aca5dde1,90b35242-32db-4570-8070-cf9dd49322a5,c6863c8f-1914-4a2d-814b-6e5853134faf,e2d19b1a-fc07-441c-b110-ca816b46fc40,9a3d0c0b-7d84-416f-9f3e-21b32a11ba1d,d8163f6b-8c40-418c-9c06-b3743af24e4e,522d7247-a75b-4af9-acb2-52a99eeced89,4b56ea9d-413a-4e24-b44e-433f7603ad6d
There are also the following lines on the master, which might have some
impact:
E [MSGID: 108008] [afr-read-txn.c:89:afr_read_txn_refresh_done]
0-media-replicate-0: Failing READ on gfid
abdc7d5e-9187-4916-ae83-a8b615e32a17: split-brain observed.
[Input/output
error]
E [MSGID: 108008] [afr-read-txn.c:89:afr_read_txn_refresh_done]
0-media-replicate-0: Failing GETXATTR on gfid
abdc7d5e-9187-4916-ae83-a8b615e32a17: split-brain observed.
[Input/output
error]
E [mem-pool.c:417:mem_get0]
(-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x809a2)
[0x7f79e436b9a2]
-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_msg+0x79f)
[0x7f79e430cb1f]
-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(mem_get0+0x81)
[0x7f79e433e4a1] ) 0-mem-pool: invalid argument [Invalid argument]
E [mem-pool.c:417:mem_get0]
(-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(recursive_rmdir+0x192)
[0x7f79e4329b32]
-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_msg+0x79f)
[0x7f79e430cb1f]
-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(mem_get0+0x81)
[0x7f79e433e4a1] ) 0-mem-pool: invalid argument [Invalid argument]
E [resource(/data/media):222:errlog] Popen: command "ssh
-oPasswordAuthentication=no -oStrictHostKeyChecking=no -i
/var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S
/tmp/gsyncd-aux-ssh-dpY5cI/8216bb7da58a00926f369bb7ac8c7e03.sock
[email protected]
/usr/lib/x86_64-linux-gnu/glusterfs/gsyncd
--session-owner 6922055e-49a1-4afd-a3a0-a47960d6ba54 -N --listen
--timeout
120 gluster://localhost:media" returned with 143, saying:
E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18
21:57:19.772896] I [cli.c:721:main] 0-cli: Started running
/usr/sbin/gluster with version 3.7.5
E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18
21:57:19.772955] I [cli.c:608:cli_rpc_init] 0-cli: Connecting to remote
glusterd at localhost
E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18
21:57:19.871930] I [MSGID: 101190]
[event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread
with index 1
E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18
21:57:19.872018] I [socket.c:2355:socket_event_handler] 0-transport:
disconnecting now
E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18
21:57:19.872898] I [cli-rpc-ops.c:6348:gf_cli_getwd_cbk] 0-cli: Received
resp to getwd
E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18
21:57:19.872963] I [input.c:36:cli_batch] 0-: Exiting with: 0
Status detail shows the following:
root@eu-gluster-1:/var/log/glusterfs/geo-replication/media# gluster
volume
geo-replication media [email protected]::media
status detail
MASTER NODE MASTER VOL MASTER BRICK
SLAVE
USER SLAVE SLAVE NODE
STATUS CRAWL STATUS LAST_SYNCED
ENTRY DATA META FAILURES CHECKPOINT TIME CHECKPOINT
COMPLETED CHECKPOINT COMPLETION TIME
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
eu-gluster-1.websitewebsitewebs.com media /data/media root
us-west-gluster.websitewebsitewebs.com::media
us-west-gluster.websitewebsitewebs.com Active Changelog Crawl
2015-11-24 20:59:25 0 0 0 633 N/A
N/A N/A
eu-gluster-2.websitewebsitewebs.com media /data/media root
us-west-gluster.websitewebsitewebs.com::media
us-west-gluster.websitewebsitewebs.com Passive N/A N/A
N/A N/A N/A N/A N/A
N/A N/A
What is the right way to retry failed items?
Can I get a list of them somehow so that I could touch them in hopes
to fix
this?
I wonder why does it not retry the items automatically?
On Tue, Nov 24, 2015 at 6:11 AM, Venky Shankar <[email protected]>
wrote:
On Tue, Nov 24, 2015 at 1:23 AM, Audrius Butkevicius
<[email protected]> wrote:
Hi,
I've got a geo-replicated gluster volume, with a few hundred thousand
images, which get generated on demand.
I started getting replication failures in the status detail view, but
it's
not obvious to me where to find the actual errors or how to
actually fix
them.
Chris here[1] mentioned about a bug in rsync (thanks!). Could that be
the issue here?
Mind checking rsync version used?
[1]:
http://www.gluster.org/pipermail/gluster-users/2015-November/024423.html
The docs seem to be secretive about this as well. It seems if I
tear the
geo-replication down, and do a force create from scratch, it goes
back in
sync again, but as the files get generated, it starts getting failures
again
at some point.
Can someone provide me with information on how to check which files
are
causing failures, and what are the actual failures? Or point me to the
relevant part in the docs?
Version 3.7.5-ubuntu1~trusty1
Related SO question:
http://stackoverflow.com/questions/33839056/gluster-geo-replication-debugging-failures
Thanks,
Audrius.
_______________________________________________
Gluster-users mailing list
[email protected]
http://www.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
[email protected]
http://www.gluster.org/mailman/listinfo/gluster-users