And a few more data points: it appears the reason for the flaky gluster fs is that not all the servers are running glusterfsd's (see below). Is there a way to force the servers to all start the glusterfsd's as they're supposed to?
The mystery rebalance did complete, and seems to have fixed some but not all problem files - ie: > drwx------ 2 spoorkas spoorkas 8211 Jun 2 00:22 QPSK_2Tx_2Rx_BH_Method2/ > ?--------- ? ? ? ? ? QPSK_2Tx_2Rx_ML_Method1 And the started/not started status has gotten weirder if possble.. The gluster volume is still being exported to clients, despite gluster insisting that the volume is not started (servers are pbs[1234] result of $ gluster volume status pbs1:Volume gli is not started pbs2:Volume gli is not started pbs3:Volume gli is not started pbs4:Volume gli is not started $ gluster volume info: pbs1:Status: Stopped pbs2:Status: Started <- aha! pbs3:Status: Started <- aha! pbs4:Status: Started This correlates with the glusterfsd status in which only pbs[23] are running glusterfsd: pbs2:root 1799 0.1 0.0 184296 16464 ? Ssl 13:07 0:06 /usr/sbin/glusterfsd -s localhost --volfile-id gli.pbs2ib.bducgl -p /var/lib/glusterd/vols/gli/run/pbs2ib-bducgl.pid -S /tmp/c70b2f910e2fe1bb485b1d76ef63e3db.socket --brick-name /bducgl -l /var/log/glusterfs/bricks/bducgl.log --xlator-option *-posix.glusterd- uuid=26de63bd-c5b7-48ba-b81d-5d77a533d077 --brick-port 24025 24026 --xlator- option gli-server.transport.rdma.listen-port=24026 --xlator-option gli- server.listen-port=24025 pbs3:root 1751 0.1 0.0 184168 16468 ? Ssl 13:07 0:06 /usr/sbin/glusterfsd -s localhost --volfile-id gli.pbs3ib.bducgl -p /var/lib/glusterd/vols/gli/run/pbs3ib-bducgl.pid -S /tmp/7096377992feb7f5a7805cafd82c3100.socket --brick-name /bducgl -l /var/log/glusterfs/bricks/bducgl.log --xlator-option *-posix.glusterd- uuid=c79c4084-d6b9-4af9-b975-40dd6aa99b42 --brick-port 24018 24020 --xlator- option gli-server.transport.rdma.listen-port=24020 --xlator-option gli- server.listen-port=24018 pbs[14] are only running the glusterd process, not any glusterfsd's In previous startups, pbs4 WAS running a glusterfsd, but pbs1 has not run one since the powerdown AFAIK. hjm On Saturday, October 06, 2012 10:19:14 PM harry mangalam wrote: > ...and should have added: > > the rebalance log (the volume claimed to be rebalancing before I shut it > down but was idle or wedged at that time) is active as well with about 1 > warning of a "1 subvolumes down -- not fixing" for every 3 informational > messages: > > 2012-10-06 22:05:35.396650] I [dht-rebalance.c:1058:gf_defrag_migrate_data] > 0-gli-dht: migrate data called on /nlduong/nduong2-t- > illiac/workspace/m5_sim/trunk/src/arch/.svn/tmp/wcprops > > [2012-10-06 22:05:35.451925] I [dht-layout.c:593:dht_layout_normalize] > 0-gli- dht: found anomalies in /nlduong/nduong2-t- > illiac/workspace/m5_sim/trunk/src/arch/.svn/wcprops. holes=1 overlaps=0 > > [2012-10-06 22:05:35.451957] W [dht-selfheal.c:875:dht_selfheal_directory] > 0- gli-dht: 1 subvolumes down -- not fixing > > > previously... > > gluster 3.3, running on ubuntu 10.04, was running OK, had to shut down for a > power outage. > > When I tried to shut it down, it insisted that it was rebalancing, but > seeemed wedged - no activity in the logs. > > Was able to shut it down tho. > > After power was restored, tried to restart the volume but altho the 4 peers > claimed to be visible and could ping each other etc: > ============================================== > Sat Oct 06 21:38:07 [0.81 0.71 0.58] root@pbs2:/var/log/glusterfs/bricks > 567 $ gluster peer status > Number of Peers: 3 > > Hostname: pbs3ib > Uuid: c79c4084-d6b9-4af9-b975-40dd6aa99b42 > State: Peer in Cluster (Connected) > > Hostname: 10.255.77.2 > Uuid: 3fcd023c-9cc9-4d1c-84c4-babfb4492e38 > State: Peer in Cluster (Connected) > > Hostname: pbs4ib > Uuid: 2a593581-bf45-446c-8f7c-212c53297803 > State: Peer in Cluster (Connected) > ============================================== > > and the volume info seemed to be OK: > ============================================== > Sat Oct 06 21:36:11 [0.75 0.67 0.56] root@pbs2:/var/log/glusterfs/bricks > 565 $ gluster volume info gli > > Volume Name: gli > Type: Distribute > Volume ID: 76cc5e88-0ac4-42ac-a4a3-31bf2ba611d4 > Status: Started > Number of Bricks: 4 > Transport-type: tcp,rdma > Bricks: > Brick1: pbs1ib:/bducgl > Brick2: pbs2ib:/bducgl > Brick3: pbs3ib:/bducgl > Brick4: pbs4ib:/bducgl > Options Reconfigured: > performance.write-behind-window-size: 1024MB > performance.flush-behind: on > performance.cache-size: 268435456 > nfs.disable: on > performance.io-thread-count: 64 > performance.quick-read: on > performance.io-cache: on > > ============================================== > some utilities claim that it was not started, even tho some clients /are > using the volume/ (tho there are some file oddities) > (from a client): > > -rw-r--r-- 1 hmangala hmangala 32935 Jun 23 2010 INSTALL.txt > ?--------- ? ? ? ? ? R-2.15.0 > drwxr-xr-x 2 hmangala hmangala 18 Sep 10 14:20 bonnie/ > drwxr-xr-x 2 root root 18 Sep 10 13:41 bonnie2/ > > drwx------ 2 spoorkas spoorkas 8211 Jun 2 00:22 QPSK_2Tx_2Rx_BH_Method2/ > ?--------- ? ? ? ? ? QPSK_2Tx_2Rx_ML_Method1 > drwx------ 2 spoorkas spoorkas 8237 Jun 3 11:22 QPSK_2Tx_2Rx_ML_Method2/ > drwx------ 2 spoorkas spoorkas 12288 Jun 4 01:24 QPSK_2Tx_3Rx_BH/ > drwx------ 2 spoorkas spoorkas 4232 Jun 2 00:26 QPSK_2Tx_3Rx_BH_Method1/ > drwx------ 2 spoorkas spoorkas 8274 Jun 2 00:34 QPSK_2Tx_3Rx_BH_Method2/ > ?--------- ? ? ? ? ? QPSK_2Tx_3Rx_ML_Method1 > ?--------- ? ? ? ? ? QPSK_2Tx_3Rx_ML_Method2 > -rw-r--r-- 1 spoorkas spoorkas 0 Apr 17 14:16 simple.sh.e1802207 > > (These files appear to be intact on the individual bricks tho.) > > ============================================== > Sat Oct 06 21:38:18 [0.76 0.71 0.58] root@pbs2:/var/log/glusterfs/bricks > 568 $ gluster volume status > Volume gli is not started > ============================================== > > and since that is the case, other utilities also claim this: > > ============================================== > Sat Oct 06 21:41:25 [1.04 0.84 0.65] root@pbs2:/var/log/glusterfs/bricks > 571 $ gluster volume status gli detail > Volume gli is not started > ============================================== > > And since they think it's not started, I can't stop it. > > How is this resolvable? -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- Passive-Aggressive Supporter of the The Canada Party: <http://www.americabutbetter.com/> _______________________________________________ Gluster-users mailing list [email protected] http://supercolony.gluster.org/mailman/listinfo/gluster-users
