I started rebalancing my volume after updating from 3.2.7 to 3.3.1.
After a few hours, I noticed a large number of failures in the rebalance
status:
Node Rebalanced-files size scanned failures
status
--------- ----------- ----------- ----------- -----------
------------
localhost 0 0Bytes 4288805 0
stopped
ml55 26275 206.2MB 4277101 14159
stopped
ml29 0 0Bytes 4288844 0
stopped
ml31 0 0Bytes 4288937 0
stopped
ml48 0 0Bytes 4288927 0
stopped
ml45 15041 50.8MB 4284304 41999
stopped
ml40 40690 413.3MB 4269721 1012
stopped
ml41 0 0Bytes 4288898 0
stopped
ml51 28558 212.7MB 4277442 32195
stopped
ml46 0 0Bytes 4288909 0
stopped
ml44 0 0Bytes 4288824 0
stopped
ml52 0 0Bytes 4288849 0
stopped
ml30 14252 183.7MB 4270711 25336
stopped
ml53 31431 354.9MB 4280450 31098
stopped
ml43 13773 2.7GB 4285256 28574
stopped
ml47 37618 241.3MB 4266889 24916
stopped
which prompted me to look at the rebalance log:
[2012-11-30 11:06:12.533580] W
[client3_1-fops.c:258:client3_1_mknod_cbk] 0-bigdata-client-12: remote
operation failed: File exists. Path:
/foo/data/onemil/dataset/bar/f8old/baz/85/m_85_269615212_b91ff3077e.t7
(00000000-0000-0000-0000-000000000000)
[2012-11-30 11:06:12.533657] E [dht-common.c:1911:dht_getxattr]
0-bigdata-dht: layout is NULL
[2012-11-30 11:06:12.533702] E
[dht-rebalance.c:1150:gf_defrag_migrate_data] 0-bigdata-dht: Failed to
get node-uuid for
/foo/data/onemil/dataset/bar/f8old/baz/85/m_85_269615212_b91ff3077e.t7
[2012-11-30 11:06:12.545497] W
[client3_1-fops.c:258:client3_1_mknod_cbk] 0-bigdata-client-13: remote
operation failed: File exists. Path:
/foo/data/onemil/dataset/bar/f8old/baz/85/m_85_217961761_965f9f192b.t7
(00000000-0000-0000-0000-000000000000)
[2012-11-30 11:06:12.546039] W
[client3_1-fops.c:258:client3_1_mknod_cbk] 0-bigdata-client-12: remote
operation failed: File exists. Path:
/foo/data/onemil/dataset/bar/f8old/baz/85/m_85_217961761_965f9f192b.t7
(00000000-0000-0000-0000-000000000000)
[2012-11-30 11:06:12.546159] E [dht-common.c:1911:dht_getxattr]
0-bigdata-dht: layout is NULL
[2012-11-30 11:06:12.546199] E
[dht-rebalance.c:1150:gf_defrag_migrate_data] 0-bigdata-dht: Failed to
get node-uuid for
/foo/data/onemil/dataset/bar/f8old/baz/85/m_85_217961761_965f9f192b.t7
[2012-11-30 11:06:12.617940] W
[client3_1-fops.c:258:client3_1_mknod_cbk] 0-bigdata-client-12: remote
operation failed: File exists. Path:
/foo/data/onemil/dataset/bar/f8old/baz/85/m_85_211665292_59a24211c3.t7
(00000000-0000-0000-0000-000000000000)
[2012-11-30 11:06:12.618024] W
[client3_1-fops.c:258:client3_1_mknod_cbk] 0-bigdata-client-13: remote
operation failed: File exists. Path:
/foo/data/onemil/dataset/bar/f8old/baz/85/m_85_211665292_59a24211c3.t7
(00000000-0000-0000-0000-000000000000)
[2012-11-30 11:06:12.618150] E [dht-common.c:1911:dht_getxattr]
0-bigdata-dht: layout is NULL
[2012-11-30 11:06:12.618189] E
[dht-rebalance.c:1150:gf_defrag_migrate_data] 0-bigdata-dht: Failed to
get node-uuid for
/foo/data/onemil/dataset/bar/f8old/baz/85/m_85_211665292_59a24211c3.t7
[2012-11-30 11:06:12.620798] I
[dht-common.c:954:dht_lookup_everywhere_cbk] 0-bigdata-dht: deleting
stale linkfile
/foo/data/onemil/dataset/bar/f8old/baz/85/m_85_282643649_15d4108d95.t7
on bigdata-replicate-6
[...] (at this point, I stopped rebalancing, and got the following in
the logs)
[2012-11-30 11:06:33.152153] E
[dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout
failed for /foo/data/onemil/dataset/bar/f8old/baz/85
[2012-11-30 11:06:33.153628] E
[dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout
failed for /foo/data/onemil/dataset/bar/f8old/baz
[2012-11-30 11:06:33.154641] E
[dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout
failed for /foo/data/onemil/dataset/bar/f8old
[2012-11-30 11:06:33.155602] E
[dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout
failed for /foo/data/onemil/dataset/bar
[2012-11-30 11:06:33.156552] E
[dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout
failed for /foo/data/onemil/dataset
[2012-11-30 11:06:33.157538] E
[dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout
failed for /foo/data/onemil
[2012-11-30 11:06:33.158526] E
[dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout
failed for /foo/data
[2012-11-30 11:06:33.159459] E
[dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout
failed for /foo
[2012-11-30 11:06:33.160496] I
[dht-rebalance.c:1626:gf_defrag_status_get] 0-glusterfs: Rebalance is
stopped
[2012-11-30 11:06:33.160518] I
[dht-rebalance.c:1629:gf_defrag_status_get] 0-glusterfs: Files
migrated: 14252, size: 192620657, lookups: 4270711, failures: 25336
[2012-11-30 11:06:33.173344] W [glusterfsd.c:831:cleanup_and_exit]
(-->/lib64/libc.so.6(clone+0x6d) [0x3d676e811d]
(-->/lib64/libpthread.so.0() [0x3d68207851]
(-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xdd) [0x405d4d]))) 0-:
received signum (15), shutting down
This kind of error kept appearing many times per second. I cancelled the
rebalancing operation just in case it was doing anything bad. There
doesn't seem to be anything weird in the system logs of the bricks that
hold the files mentioned in the errors. The files are still accessible
through my mounted volume.
Any idea what might be wrong?
Thanks,
Pierre
volume info:
Volume Name: bigdata
Type: Distributed-Replicate
Volume ID: 56498956-7b4b-4ee3-9d2b-4c8cfce26051
Status: Started
Number of Bricks: 20 x 2 = 40
Transport-type: tcp
Bricks:
Brick1: ml43:/mnt/localb
Brick2: ml44:/mnt/localb
Brick3: ml43:/mnt/localc
Brick4: ml44:/mnt/localc
Brick5: ml45:/mnt/localb
Brick6: ml46:/mnt/localb
Brick7: ml45:/mnt/localc
Brick8: ml46:/mnt/localc
Brick9: ml47:/mnt/localb
Brick10: ml48:/mnt/localb
Brick11: ml47:/mnt/localc
Brick12: ml48:/mnt/localc
Brick13: ml45:/mnt/locald
Brick14: ml46:/mnt/locald
Brick15: ml47:/mnt/locald
Brick16: ml48:/mnt/locald
Brick17: ml51:/mnt/localb
Brick18: ml52:/mnt/localb
Brick19: ml51:/mnt/localc
Brick20: ml52:/mnt/localc
Brick21: ml51:/mnt/locald
Brick22: ml52:/mnt/locald
Brick23: ml53:/mnt/locald
Brick24: ml54:/mnt/locald
Brick25: ml53:/mnt/localc
Brick26: ml54:/mnt/localc
Brick27: ml53:/mnt/localb
Brick28: ml54:/mnt/localb
Brick29: ml55:/mnt/localb
Brick30: ml29:/mnt/localb
Brick31: ml55:/mnt/localc
Brick32: ml29:/mnt/localc
Brick33: ml30:/mnt/localc
Brick34: ml31:/mnt/localc
Brick35: ml30:/mnt/localb
Brick36: ml31:/mnt/localb
Brick37: ml40:/mnt/localb
Brick38: ml41:/mnt/localb
Brick39: ml40:/mnt/localc
Brick40: ml41:/mnt/localc
Options Reconfigured:
performance.quick-read: on
nfs.disable: on
nfs.register-with-portmap: OFF
volume status:
Status of volume: bigdata
Gluster process Port Online Pid
------------------------------------------------------------------------------
Brick ml43:/mnt/localb 24012 Y 2694
Brick ml44:/mnt/localb 24012 Y
20374
Brick ml43:/mnt/localc 24013 Y 2699
Brick ml44:/mnt/localc 24013 Y
20379
Brick ml45:/mnt/localb 24012 Y 3147
Brick ml46:/mnt/localb 24012 Y
25789
Brick ml45:/mnt/localc 24013 Y 3152
Brick ml46:/mnt/localc 24013 Y
25794
Brick ml47:/mnt/localb 24012 Y 3181
Brick ml48:/mnt/localb 24012 Y 4852
Brick ml47:/mnt/localc 24013 Y 3186
Brick ml48:/mnt/localc 24013 Y 4857
Brick ml45:/mnt/locald 24014 Y 3157
Brick ml46:/mnt/locald 24014 Y
25799
Brick ml47:/mnt/locald 24014 Y 3191
Brick ml48:/mnt/locald 24014 Y 4862
Brick ml51:/mnt/localb 24009 Y
30251
Brick ml52:/mnt/localb 24012 Y
28541
Brick ml51:/mnt/localc 24010 Y
30256
Brick ml52:/mnt/localc 24013 Y
28546
Brick ml51:/mnt/locald 24011 Y
30261
Brick ml52:/mnt/locald 24014 Y
28551
Brick ml53:/mnt/locald 24012 Y 9229
Brick ml54:/mnt/locald 24012 Y 9341
Brick ml53:/mnt/localc 24013 Y 9234
Brick ml54:/mnt/localc 24013 Y 9346
Brick ml53:/mnt/localb 24014 Y 9239
Brick ml54:/mnt/localb 24014 Y 9351
Brick ml55:/mnt/localb 24012 Y
30904
Brick ml29:/mnt/localb 24012 Y
29233
Brick ml55:/mnt/localc 24013 Y
30909
Brick ml29:/mnt/localc 24013 Y
29238
Brick ml30:/mnt/localc 24012 Y 6800
Brick ml31:/mnt/localc N/A Y 22000
Brick ml30:/mnt/localb 24013 Y 6805
Brick ml31:/mnt/localb N/A Y 22005
Brick ml40:/mnt/localb 24012 Y
26700
Brick ml41:/mnt/localb 24012 Y
25762
Brick ml40:/mnt/localc 24013 Y
26705
Brick ml41:/mnt/localc 24013 Y
25767
Self-heal Daemon on localhost N/A Y 20392
Self-heal Daemon on ml55 N/A Y 30922
Self-heal Daemon on ml54 N/A Y 9365
Self-heal Daemon on ml52 N/A Y 28565
Self-heal Daemon on ml29 N/A Y 29253
Self-heal Daemon on ml30 N/A Y 6818
Self-heal Daemon on ml43 N/A Y 2712
Self-heal Daemon on ml47 N/A Y 3205
Self-heal Daemon on ml46 N/A Y 25813
Self-heal Daemon on ml40 N/A Y 26717
Self-heal Daemon on ml31 N/A Y 22038
Self-heal Daemon on ml48 N/A Y 4876
Self-heal Daemon on ml45 N/A Y 3171
Self-heal Daemon on ml51 N/A Y 30274
Self-heal Daemon on ml41 N/A Y 25779
Self-heal Daemon on ml53 N/A Y 9253
peer status:
Number of Peers: 15
Hostname: ml52
Uuid: 4de42f67-4cca-4d28-8600-9018172563ba
State: Peer in Cluster (Connected)
Hostname: ml41
Uuid: b404851f-dfd5-4746-a3bd-81bb0d888009
State: Peer in Cluster (Connected)
Hostname: ml46
Uuid: af74d39b-09d6-47ba-9c3b-72d993dca4ce
State: Peer in Cluster (Connected)
Hostname: ml54
Uuid: c55580fa-2c9d-493d-b9d1-3bce016c8b29
State: Peer in Cluster (Connected)
Hostname: ml51
Uuid: 5491b6dc-0f96-43d9-95d9-a41018a8542c
State: Peer in Cluster (Connected)
Hostname: ml48
Uuid: efd79145-bfd9-4eea-b7a7-50be18d9ffe0
State: Peer in Cluster (Connected)
Hostname: ml43
Uuid: a9044e9a-39e1-4907-8921-43da870b7f31
State: Peer in Cluster (Connected)
Hostname: ml45
Uuid: 0eebbceb-8f62-4c55-8160-41348f90e191
State: Peer in Cluster (Connected)
Hostname: ml47
Uuid: e831092d-b196-46ec-947d-a5635e8fbd1e
State: Peer in Cluster (Connected)
Hostname: ml30
Uuid: e56b4c57-a058-4464-a1e6-c4676ebf00cc
State: Peer in Cluster (Connected)
Hostname: ml40
Uuid: ffcc06ae-100a-4fa2-888e-803a41ae946c
State: Peer in Cluster (Connected)
Hostname: ml55
Uuid: 366339ed-52e5-4722-a1b3-e3bb1c49ea4f
State: Peer in Cluster (Connected)
Hostname: ml31
Uuid: 699019f6-2f4a-45cb-bfa4-f209745f8a6d
State: Peer in Cluster (Connected)
Hostname: ml29
Uuid: 58aa8a16-5d2b-4c06-8f06-2fd0f7fc5a37
State: Peer in Cluster (Connected)
Hostname: ml53
Uuid: 1dc6ee08-c606-4755-8756-b553f66efa88
State: Peer in Cluster (Connected)
gluster version:
glusterfs 3.3.1 built on Oct 11 2012 21:49:37
rpms:
glusterfs.x86_64 3.3.1-1.el6 @glusterfs-epel
glusterfs-debuginfo.x86_64 3.3.1-1.el6 @glusterfs-epel
glusterfs-fuse.x86_64 3.3.1-1.el6 @glusterfs-epel
glusterfs-rdma.x86_64 3.3.1-1.el6 @glusterfs-epel
glusterfs-server.x86_64 3.3.1-1.el6 @glusterfs-epel
kernel:
Linux 2.6.32-131.17.1.el6.x86_64 #1 SMP Wed Oct 5 17:19:54 CDT 2011
x86_64 x86_64 x86_64 GNU/Linux
OS: Scientific Linux 6.1 (this is based on CentOS)
_______________________________________________
Gluster-users mailing list
[email protected]
http://supercolony.gluster.org/mailman/listinfo/gluster-users