I know it is poor form to reply to yourself, but I would appreciate it if anyone has any insight on this problem.
Mike Preston Infrastructure Team | SYNETY www.synety.com<http://www.synety.com> direct: 0116 424 4016 mobile: 07950 892038 main: 0116 424 4000 From: Mike Preston [mailto:[email protected]] Sent: 24 September 2013 09:52 To: [email protected] Subject: Re: [Openstack] Replication error root@storage-proxy-01:~/swift# swift-ring-builder object.builder validate root@storage-proxy-01:~/swift# echo $? 0 I ran md5sum on the ring files on both the proxy (where we generate them) and the nodes and confirmed that they are identical. root@storage-proxy-01:~/swift# swift-ring-builder object.builder object.builder, build version 72 65536 partitions, 3 replicas, 4 zones, 32 devices, 999.99 balance The minimum number of hours before a partition can be reassigned is 3 Devices: id zone ip address port name weight partitions balance meta 0 1 10.20.15.51 6000 sdb1 3000.00 7123 1.44 1 1 10.20.15.51 6000 sdc1 3000.00 7123 1.44 2 1 10.20.15.51 6000 sdd1 3000.00 7122 1.43 3 1 10.20.15.51 6000 sde1 3000.00 7123 1.44 4 1 10.20.15.51 6000 sdf1 3000.00 7122 1.43 5 1 10.20.15.51 6000 sdg1 3000.00 7123 1.44 6 3 10.20.15.51 6000 sdh1 0.00 1273 999.99 7 3 10.20.15.51 6000 sdi1 0.00 1476 999.99 8 2 10.20.15.52 6000 sdb1 3000.00 7122 1.43 9 2 10.20.15.52 6000 sdc1 3000.00 7122 1.43 10 2 10.20.15.52 6000 sdd1 3000.00 7122 1.43 11 2 10.20.15.52 6000 sde1 3000.00 7122 1.43 12 2 10.20.15.52 6000 sdf1 3000.00 7122 1.43 13 2 10.20.15.52 6000 sdg1 3000.00 7122 1.43 14 3 10.20.15.52 6000 sdh1 0.00 1378 999.99 15 3 10.20.15.52 6000 sdi1 0.00 997 999.99 16 3 10.20.15.53 6000 sas0 3000.00 6130 -12.70 17 3 10.20.15.53 6000 sas1 3000.00 6130 -12.70 18 3 10.20.15.53 6000 sas2 3000.00 6129 -12.71 19 3 10.20.15.53 6000 sas3 3000.00 6130 -12.70 20 3 10.20.15.53 6000 sas4 3000.00 6130 -12.70 21 3 10.20.15.53 6000 sas5 3000.00 6130 -12.70 22 3 10.20.15.53 6000 sas6 3000.00 6129 -12.71 23 3 10.20.15.53 6000 sas7 3000.00 6129 -12.71 24 4 10.20.15.54 6000 sas0 3000.00 7122 1.43 25 4 10.20.15.54 6000 sas1 3000.00 7122 1.43 26 4 10.20.15.54 6000 sas2 3000.00 7123 1.44 27 4 10.20.15.54 6000 sas3 3000.00 7123 1.44 28 4 10.20.15.54 6000 sas4 3000.00 7122 1.43 29 4 10.20.15.54 6000 sas5 3000.00 7122 1.43 30 4 10.20.15.54 6000 sas6 3000.00 7123 1.44 31 4 10.20.15.54 6000 sas7 3000.00 7122 1.43 (We are currently migrating data between boxes due to cluster hardware replacement, which is why zone 3 is weighted as such on the first 2 nodes) Filelist attached (for the objects/ directory on the devices)... but I see nothing out of place. I'll run a full fsck on the drives tonight, try to rule that out. Thanks for your help. Mike Preston Infrastructure Team | SYNETY www.synety.com<http://www.synety.com> direct: 0116 424 4016 mobile: 07950 892038 main: 0116 424 4000 From: Clay Gerrard [mailto:[email protected]] Sent: 23 September 2013 20:34 To: Mike Preston Cc: [email protected]<mailto:[email protected]> Subject: Re: [Openstack] Replication error Run `swift-ring-builder /etc/swift/object.builder validate` - it should have no errors and exit 0. Can you provide a paste of the output from `swift-ring-builder /etc/swift/object.builder` as well - it should list some general info about the ring (number of replicas, and list of devices). Rebalance the ring and make sure it's been distributed to all nodes. The particular line you're seeing pop up in the traceback seems to be looking for all of the nodes for a particular partition it found in the objects' dir. I'm not seeing any local sanitization [1] around those top level directory names, so maybe it's just some garbage that created there outside of swift, or some file system corruption? Can you provide the output from `ls /srv/node/objects` (or wherever you have devices configured) -Clay 1. https://bugs.launchpad.net/swift/+bug/1229372 On Mon, Sep 23, 2013 at 2:34 AM, Mike Preston <[email protected]<mailto:[email protected]>> wrote: Hi, We are seeing a replication error on swift. The error only is seen on a single node, the other nodes appear to be working fine. Installed version is debian wheezy with swift 1.4.8-2+deb7u1 Sep 23 10:33:03 storage-node-01 object-replicator Starting object replication pass. Sep 23 10:33:03 storage-node-01 object-replicator Exception in top-level replication loop: #012Traceback (most recent call last):#012 File "/usr/lib/python2.7/dist-packages/swift/obj/replicator.py", line 564, in replicate#012 jobs = self.collect_jobs()#012 File "/usr/lib/python2.7/dist-packages/swift/obj/replicator.py", line 536, in collect_jobs#012 self.object_ring.get_part_nodes(int(partition))#012 File "/usr/lib/python2.7/dist-packages/swift/common/ring/ring.py", line 103, in get_part_nodes#012 return [self.devs[r[part]] for r in self._replica2part2dev_id]#012IndexError: array index out of range Sep 23 10:33:03 storage-node-01 object-replicator Nothing replicated for 0.728466033936 seconds. Sep 23 10:33:03 storage-node-01 object-replicator Object replication complete. (0.01 minutes) Can anyone shed any light on this or next steps in debugging it or fixing it? Mike Preston Infrastructure Team | SYNETY www.synety.com<http://www.synety.com> direct: 0116 424 4016 mobile: 07950 892038 main: 0116 424 4000 _______________________________________________ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : [email protected]<mailto:[email protected]> Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
_______________________________________________ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : [email protected] Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
