I started rebalancing my volume after updating from 3.2.7 to 3.3.1. After a few hours, I noticed a large number of failures in the rebalance status:

Node Rebalanced-files size scanned failures status --------- ----------- ----------- ----------- ----------- ------------ localhost 0 0Bytes 4288805 0 stopped ml55 26275 206.2MB 4277101 14159 stopped ml29 0 0Bytes 4288844 0 stopped ml31 0 0Bytes 4288937 0 stopped ml48 0 0Bytes 4288927 0 stopped ml45 15041 50.8MB 4284304 41999 stopped ml40 40690 413.3MB 4269721 1012 stopped ml41 0 0Bytes 4288898 0 stopped ml51 28558 212.7MB 4277442 32195 stopped ml46 0 0Bytes 4288909 0 stopped ml44 0 0Bytes 4288824 0 stopped ml52 0 0Bytes 4288849 0 stopped ml30 14252 183.7MB 4270711 25336 stopped ml53 31431 354.9MB 4280450 31098 stopped ml43 13773 2.7GB 4285256 28574 stopped ml47 37618 241.3MB 4266889 24916 stopped

which prompted me to look at the rebalance log:

[2012-11-30 11:06:12.533580] W [client3_1-fops.c:258:client3_1_mknod_cbk] 0-bigdata-client-12: remote operation failed: File exists. Path: /foo/data/onemil/dataset/bar/f8old/baz/85/m_85_269615212_b91ff3077e.t7 (00000000-0000-0000-0000-000000000000) [2012-11-30 11:06:12.533657] E [dht-common.c:1911:dht_getxattr] 0-bigdata-dht: layout is NULL [2012-11-30 11:06:12.533702] E [dht-rebalance.c:1150:gf_defrag_migrate_data] 0-bigdata-dht: Failed to get node-uuid for /foo/data/onemil/dataset/bar/f8old/baz/85/m_85_269615212_b91ff3077e.t7 [2012-11-30 11:06:12.545497] W [client3_1-fops.c:258:client3_1_mknod_cbk] 0-bigdata-client-13: remote operation failed: File exists. Path: /foo/data/onemil/dataset/bar/f8old/baz/85/m_85_217961761_965f9f192b.t7 (00000000-0000-0000-0000-000000000000) [2012-11-30 11:06:12.546039] W [client3_1-fops.c:258:client3_1_mknod_cbk] 0-bigdata-client-12: remote operation failed: File exists. Path: /foo/data/onemil/dataset/bar/f8old/baz/85/m_85_217961761_965f9f192b.t7 (00000000-0000-0000-0000-000000000000) [2012-11-30 11:06:12.546159] E [dht-common.c:1911:dht_getxattr] 0-bigdata-dht: layout is NULL [2012-11-30 11:06:12.546199] E [dht-rebalance.c:1150:gf_defrag_migrate_data] 0-bigdata-dht: Failed to get node-uuid for /foo/data/onemil/dataset/bar/f8old/baz/85/m_85_217961761_965f9f192b.t7 [2012-11-30 11:06:12.617940] W [client3_1-fops.c:258:client3_1_mknod_cbk] 0-bigdata-client-12: remote operation failed: File exists. Path: /foo/data/onemil/dataset/bar/f8old/baz/85/m_85_211665292_59a24211c3.t7 (00000000-0000-0000-0000-000000000000) [2012-11-30 11:06:12.618024] W [client3_1-fops.c:258:client3_1_mknod_cbk] 0-bigdata-client-13: remote operation failed: File exists. Path: /foo/data/onemil/dataset/bar/f8old/baz/85/m_85_211665292_59a24211c3.t7 (00000000-0000-0000-0000-000000000000) [2012-11-30 11:06:12.618150] E [dht-common.c:1911:dht_getxattr] 0-bigdata-dht: layout is NULL [2012-11-30 11:06:12.618189] E [dht-rebalance.c:1150:gf_defrag_migrate_data] 0-bigdata-dht: Failed to get node-uuid for /foo/data/onemil/dataset/bar/f8old/baz/85/m_85_211665292_59a24211c3.t7 [2012-11-30 11:06:12.620798] I [dht-common.c:954:dht_lookup_everywhere_cbk] 0-bigdata-dht: deleting stale linkfile /foo/data/onemil/dataset/bar/f8old/baz/85/m_85_282643649_15d4108d95.t7 on bigdata-replicate-6

[...] (at this point, I stopped rebalancing, and got the following in the logs)
[2012-11-30 11:06:33.152153] E [dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout failed for /foo/data/onemil/dataset/bar/f8old/baz/85 [2012-11-30 11:06:33.153628] E [dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout failed for /foo/data/onemil/dataset/bar/f8old/baz [2012-11-30 11:06:33.154641] E [dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout failed for /foo/data/onemil/dataset/bar/f8old [2012-11-30 11:06:33.155602] E [dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout failed for /foo/data/onemil/dataset/bar [2012-11-30 11:06:33.156552] E [dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout failed for /foo/data/onemil/dataset [2012-11-30 11:06:33.157538] E [dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout failed for /foo/data/onemil [2012-11-30 11:06:33.158526] E [dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout failed for /foo/data [2012-11-30 11:06:33.159459] E [dht-rebalance.c:1374:gf_defrag_fix_layout] 0-bigdata-dht: Fix layout failed for /foo [2012-11-30 11:06:33.160496] I [dht-rebalance.c:1626:gf_defrag_status_get] 0-glusterfs: Rebalance is stopped [2012-11-30 11:06:33.160518] I [dht-rebalance.c:1629:gf_defrag_status_get] 0-glusterfs: Files migrated: 14252, size: 192620657, lookups: 4270711, failures: 25336 [2012-11-30 11:06:33.173344] W [glusterfsd.c:831:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x3d676e811d] (-->/lib64/libpthread.so.0() [0x3d68207851] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xdd) [0x405d4d]))) 0-: received signum (15), shutting down


This kind of error kept appearing many times per second. I cancelled the rebalancing operation just in case it was doing anything bad. There doesn't seem to be anything weird in the system logs of the bricks that hold the files mentioned in the errors. The files are still accessible through my mounted volume.

Any idea what might be wrong?

Thanks,

Pierre



volume info:
Volume Name: bigdata
Type: Distributed-Replicate
Volume ID: 56498956-7b4b-4ee3-9d2b-4c8cfce26051
Status: Started
Number of Bricks: 20 x 2 = 40
Transport-type: tcp
Bricks:
Brick1: ml43:/mnt/localb
Brick2: ml44:/mnt/localb
Brick3: ml43:/mnt/localc
Brick4: ml44:/mnt/localc
Brick5: ml45:/mnt/localb
Brick6: ml46:/mnt/localb
Brick7: ml45:/mnt/localc
Brick8: ml46:/mnt/localc
Brick9: ml47:/mnt/localb
Brick10: ml48:/mnt/localb
Brick11: ml47:/mnt/localc
Brick12: ml48:/mnt/localc
Brick13: ml45:/mnt/locald
Brick14: ml46:/mnt/locald
Brick15: ml47:/mnt/locald
Brick16: ml48:/mnt/locald
Brick17: ml51:/mnt/localb
Brick18: ml52:/mnt/localb
Brick19: ml51:/mnt/localc
Brick20: ml52:/mnt/localc
Brick21: ml51:/mnt/locald
Brick22: ml52:/mnt/locald
Brick23: ml53:/mnt/locald
Brick24: ml54:/mnt/locald
Brick25: ml53:/mnt/localc
Brick26: ml54:/mnt/localc
Brick27: ml53:/mnt/localb
Brick28: ml54:/mnt/localb
Brick29: ml55:/mnt/localb
Brick30: ml29:/mnt/localb
Brick31: ml55:/mnt/localc
Brick32: ml29:/mnt/localc
Brick33: ml30:/mnt/localc
Brick34: ml31:/mnt/localc
Brick35: ml30:/mnt/localb
Brick36: ml31:/mnt/localb
Brick37: ml40:/mnt/localb
Brick38: ml41:/mnt/localb
Brick39: ml40:/mnt/localc
Brick40: ml41:/mnt/localc
Options Reconfigured:
performance.quick-read: on
nfs.disable: on
nfs.register-with-portmap: OFF


volume status:
Status of volume: bigdata
Gluster process                                         Port Online  Pid
------------------------------------------------------------------------------
Brick ml43:/mnt/localb                                  24012 Y       2694
Brick ml44:/mnt/localb 24012 Y 20374
Brick ml43:/mnt/localc                                  24013 Y       2699
Brick ml44:/mnt/localc 24013 Y 20379
Brick ml45:/mnt/localb                                  24012 Y       3147
Brick ml46:/mnt/localb 24012 Y 25789
Brick ml45:/mnt/localc                                  24013 Y       3152
Brick ml46:/mnt/localc 24013 Y 25794
Brick ml47:/mnt/localb                                  24012 Y       3181
Brick ml48:/mnt/localb                                  24012 Y       4852
Brick ml47:/mnt/localc                                  24013 Y       3186
Brick ml48:/mnt/localc                                  24013 Y       4857
Brick ml45:/mnt/locald                                  24014 Y       3157
Brick ml46:/mnt/locald 24014 Y 25799
Brick ml47:/mnt/locald                                  24014 Y       3191
Brick ml48:/mnt/locald                                  24014 Y       4862
Brick ml51:/mnt/localb 24009 Y 30251 Brick ml52:/mnt/localb 24012 Y 28541 Brick ml51:/mnt/localc 24010 Y 30256 Brick ml52:/mnt/localc 24013 Y 28546 Brick ml51:/mnt/locald 24011 Y 30261 Brick ml52:/mnt/locald 24014 Y 28551
Brick ml53:/mnt/locald                                  24012 Y       9229
Brick ml54:/mnt/locald                                  24012 Y       9341
Brick ml53:/mnt/localc                                  24013 Y       9234
Brick ml54:/mnt/localc                                  24013 Y       9346
Brick ml53:/mnt/localb                                  24014 Y       9239
Brick ml54:/mnt/localb                                  24014 Y       9351
Brick ml55:/mnt/localb 24012 Y 30904 Brick ml29:/mnt/localb 24012 Y 29233 Brick ml55:/mnt/localc 24013 Y 30909 Brick ml29:/mnt/localc 24013 Y 29238
Brick ml30:/mnt/localc                                  24012 Y       6800
Brick ml31:/mnt/localc                                  N/A Y       22000
Brick ml30:/mnt/localb                                  24013 Y       6805
Brick ml31:/mnt/localb                                  N/A Y       22005
Brick ml40:/mnt/localb 24012 Y 26700 Brick ml41:/mnt/localb 24012 Y 25762 Brick ml40:/mnt/localc 24013 Y 26705 Brick ml41:/mnt/localc 24013 Y 25767
Self-heal Daemon on localhost                           N/A Y       20392
Self-heal Daemon on ml55                                N/A Y       30922
Self-heal Daemon on ml54                                N/A Y       9365
Self-heal Daemon on ml52                                N/A Y       28565
Self-heal Daemon on ml29                                N/A Y       29253
Self-heal Daemon on ml30                                N/A Y       6818
Self-heal Daemon on ml43                                N/A Y       2712
Self-heal Daemon on ml47                                N/A Y       3205
Self-heal Daemon on ml46                                N/A Y       25813
Self-heal Daemon on ml40                                N/A Y       26717
Self-heal Daemon on ml31                                N/A Y       22038
Self-heal Daemon on ml48                                N/A Y       4876
Self-heal Daemon on ml45                                N/A Y       3171
Self-heal Daemon on ml51                                N/A Y       30274
Self-heal Daemon on ml41                                N/A Y       25779
Self-heal Daemon on ml53                                N/A Y       9253

peer status:
Number of Peers: 15

Hostname: ml52
Uuid: 4de42f67-4cca-4d28-8600-9018172563ba
State: Peer in Cluster (Connected)

Hostname: ml41
Uuid: b404851f-dfd5-4746-a3bd-81bb0d888009
State: Peer in Cluster (Connected)

Hostname: ml46
Uuid: af74d39b-09d6-47ba-9c3b-72d993dca4ce
State: Peer in Cluster (Connected)

Hostname: ml54
Uuid: c55580fa-2c9d-493d-b9d1-3bce016c8b29
State: Peer in Cluster (Connected)

Hostname: ml51
Uuid: 5491b6dc-0f96-43d9-95d9-a41018a8542c
State: Peer in Cluster (Connected)

Hostname: ml48
Uuid: efd79145-bfd9-4eea-b7a7-50be18d9ffe0
State: Peer in Cluster (Connected)

Hostname: ml43
Uuid: a9044e9a-39e1-4907-8921-43da870b7f31
State: Peer in Cluster (Connected)

Hostname: ml45
Uuid: 0eebbceb-8f62-4c55-8160-41348f90e191
State: Peer in Cluster (Connected)

Hostname: ml47
Uuid: e831092d-b196-46ec-947d-a5635e8fbd1e
State: Peer in Cluster (Connected)

Hostname: ml30
Uuid: e56b4c57-a058-4464-a1e6-c4676ebf00cc
State: Peer in Cluster (Connected)

Hostname: ml40
Uuid: ffcc06ae-100a-4fa2-888e-803a41ae946c
State: Peer in Cluster (Connected)

Hostname: ml55
Uuid: 366339ed-52e5-4722-a1b3-e3bb1c49ea4f
State: Peer in Cluster (Connected)

Hostname: ml31
Uuid: 699019f6-2f4a-45cb-bfa4-f209745f8a6d
State: Peer in Cluster (Connected)

Hostname: ml29
Uuid: 58aa8a16-5d2b-4c06-8f06-2fd0f7fc5a37
State: Peer in Cluster (Connected)

Hostname: ml53
Uuid: 1dc6ee08-c606-4755-8756-b553f66efa88
State: Peer in Cluster (Connected)

gluster version:
glusterfs 3.3.1 built on Oct 11 2012 21:49:37

rpms:
glusterfs.x86_64 3.3.1-1.el6                @glusterfs-epel
glusterfs-debuginfo.x86_64            3.3.1-1.el6 @glusterfs-epel
glusterfs-fuse.x86_64                 3.3.1-1.el6 @glusterfs-epel
glusterfs-rdma.x86_64                 3.3.1-1.el6 @glusterfs-epel
glusterfs-server.x86_64               3.3.1-1.el6 @glusterfs-epel

kernel:
Linux 2.6.32-131.17.1.el6.x86_64 #1 SMP Wed Oct 5 17:19:54 CDT 2011 x86_64 x86_64 x86_64 GNU/Linux

OS: Scientific Linux 6.1 (this is based on CentOS)
_______________________________________________
Gluster-users mailing list
[email protected]
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Reply via email to