And a few more data points: it appears the reason for the flaky gluster fs is that not all the servers are running glusterfsd's (see below). Is there a way to force the servers to all start the glusterfsd's as they're supposed to?
The mystery rebalance did complete, and seems to have fixed some but not all problem files - ie: > drwx------ 2 spoorkas spoorkas 8211 Jun 2 00:22 QPSK_2Tx_2Rx_BH_Method2/ > ?--------- ? ? ? ? ? QPSK_2Tx_2Rx_ML_Method1 And the started/not started status has gotten weirder if possble.. The gluster volume is still being exported to clients, despite gluster insisting that the volume is not started (servers are pbs[1234] result of $ gluster volume status pbs1:Volume gli is not started pbs2:Volume gli is not started pbs3:Volume gli is not started pbs4:Volume gli is not started $ gluster volume info: pbs1:Status: Stopped pbs2:Status: Started <- aha! pbs3:Status: Started <- aha! pbs4:Status: Started This correlates with the glusterfsd status in which only pbs[23] are running glusterfsd: pbs2:root 1799 0.1 0.0 184296 16464 ? Ssl 13:07 0:06 /usr/sbin/glusterfsd -s localhost --volfile-id gli.pbs2ib.bducgl -p /var/lib/glusterd/vols/gli/run/pbs2ib-bducgl.pid -S /tmp/c70b2f910e2fe1bb485b1d76ef63e3db.socket --brick-name /bducgl -l /var/log/glusterfs/bricks/bducgl.log --xlator-option *-posix.glusterd- uuid=26de63bd-c5b7-48ba-b81d-5d77a533d077 --brick-port 24025 24026 --xlator- option gli-server.transport.rdma.listen-port=24026 --xlator-option gli- server.listen-port=24025 pbs3:root 1751 0.1 0.0 184168 16468 ? Ssl 13:07 0:06 /usr/sbin/glusterfsd -s localhost --volfile-id gli.pbs3ib.bducgl -p /var/lib/glusterd/vols/gli/run/pbs3ib-bducgl.pid -S /tmp/7096377992feb7f5a7805cafd82c3100.socket --brick-name /bducgl -l /var/log/glusterfs/bricks/bducgl.log --xlator-option *-posix.glusterd- uuid=c79c4084-d6b9-4af9-b975-40dd6aa99b42 --brick-port 24018 24020 --xlator- option gli-server.transport.rdma.listen-port=24020 --xlator-option gli- server.listen-port=24018 pbs[14] are only running the glusterd process, not any glusterfsd's In previous startups, pbs4 WAS running a glusterfsd, but pbs1 has not run one since the powerdown AFAIK. The glusterfsd man page suggests how to run glusterfsd manually, but my attempts to do have not been successful, either by running the previously running glusterfsd command, ie: /usr/sbin/glusterfsd -s localhost --volfile-id gli.pbs4ib.bducgl \ -p /var/lib/glusterd/vols/gli/run/pbs4ib-bducgl.pid -S /tmp/c949ff0eb195fea64311730525afed68.socket \ --brick-name /bducgl -l /var/log/glusterfs/bricks/bducgl.log \ --xlator-option *-posix.glusterd-uuid=2a593581-bf45-446c-8f7c-212c53297803 \ --brick-port 24009 24010 \ --xlator-option gli-server.transport.rdma.listen-port=24010 \ --xlator-option gli-server.listen-port=24009 or by running a minimal version thereof /usr/sbin/glusterfsd -s localhost \ -l /var/log/glusterfs/bricks/bducgl.log \ --volfile-id gli.pbs4ib.bducgl \ -p /var/lib/glusterd/vols/gli/run/pbs4ib-bducgl.pid \ --brick-name /bducgl Is there a trick to convince the glusterfsd to stay up? hjm On Saturday, October 06, 2012 10:19:14 PM harry mangalam wrote: > ...and should have added: > > the rebalance log (the volume claimed to be rebalancing before I shut it > down but was idle or wedged at that time) is active as well with about 1 > warning of a "1 subvolumes down -- not fixing" for every 3 informational > messages: > > 2012-10-06 22:05:35.396650] I [dht-rebalance.c:1058:gf_defrag_migrate_data] > 0-gli-dht: migrate data called on /nlduong/nduong2-t- > illiac/workspace/m5_sim/trunk/src/arch/.svn/tmp/wcprops > > [2012-10-06 22:05:35.451925] I [dht-layout.c:593:dht_layout_normalize] > 0-gli- dht: found anomalies in /nlduong/nduong2-t- > illiac/workspace/m5_sim/trunk/src/arch/.svn/wcprops. holes=1 overlaps=0 > > [2012-10-06 22:05:35.451957] W [dht-selfheal.c:875:dht_selfheal_directory] > 0- gli-dht: 1 subvolumes down -- not fixing > > > previously... > > gluster 3.3, running on ubuntu 10.04, was running OK, had to shut down for a > power outage. > > When I tried to shut it down, it insisted that it was rebalancing, but > seeemed wedged - no activity in the logs. > > Was able to shut it down tho. > > After power was restored, tried to restart the volume but altho the 4 peers > claimed to be visible and could ping each other etc: > ============================================== > Sat Oct 06 21:38:07 [0.81 0.71 0.58] root@pbs2:/var/log/glusterfs/bricks > 567 $ gluster peer status > Number of Peers: 3 > > Hostname: pbs3ib > Uuid: c79c4084-d6b9-4af9-b975-40dd6aa99b42 > State: Peer in Cluster (Connected) > > Hostname: 10.255.77.2 > Uuid: 3fcd023c-9cc9-4d1c-84c4-babfb4492e38 > State: Peer in Cluster (Connected) > > Hostname: pbs4ib > Uuid: 2a593581-bf45-446c-8f7c-212c53297803 > State: Peer in Cluster (Connected) > ============================================== > > and the volume info seemed to be OK: > ============================================== > Sat Oct 06 21:36:11 [0.75 0.67 0.56] root@pbs2:/var/log/glusterfs/bricks > 565 $ gluster volume info gli > > Volume Name: gli > Type: Distribute > Volume ID: 76cc5e88-0ac4-42ac-a4a3-31bf2ba611d4 > Status: Started > Number of Bricks: 4 > Transport-type: tcp,rdma > Bricks: > Brick1: pbs1ib:/bducgl > Brick2: pbs2ib:/bducgl > Brick3: pbs3ib:/bducgl > Brick4: pbs4ib:/bducgl > Options Reconfigured: > performance.write-behind-window-size: 1024MB > performance.flush-behind: on > performance.cache-size: 268435456 > nfs.disable: on > performance.io-thread-count: 64 > performance.quick-read: on > performance.io-cache: on > > ============================================== > some utilities claim that it was not started, even tho some clients /are > using the volume/ (tho there are some file oddities) > (from a client): > > -rw-r--r-- 1 hmangala hmangala 32935 Jun 23 2010 INSTALL.txt > ?--------- ? ? ? ? ? R-2.15.0 > drwxr-xr-x 2 hmangala hmangala 18 Sep 10 14:20 bonnie/ > drwxr-xr-x 2 root root 18 Sep 10 13:41 bonnie2/ > > drwx------ 2 spoorkas spoorkas 8211 Jun 2 00:22 QPSK_2Tx_2Rx_BH_Method2/ > ?--------- ? ? ? ? ? QPSK_2Tx_2Rx_ML_Method1 > drwx------ 2 spoorkas spoorkas 8237 Jun 3 11:22 QPSK_2Tx_2Rx_ML_Method2/ > drwx------ 2 spoorkas spoorkas 12288 Jun 4 01:24 QPSK_2Tx_3Rx_BH/ > drwx------ 2 spoorkas spoorkas 4232 Jun 2 00:26 QPSK_2Tx_3Rx_BH_Method1/ > drwx------ 2 spoorkas spoorkas 8274 Jun 2 00:34 QPSK_2Tx_3Rx_BH_Method2/ > ?--------- ? ? ? ? ? QPSK_2Tx_3Rx_ML_Method1 > ?--------- ? ? ? ? ? QPSK_2Tx_3Rx_ML_Method2 > -rw-r--r-- 1 spoorkas spoorkas 0 Apr 17 14:16 simple.sh.e1802207 > > (These files appear to be intact on the individual bricks tho.) > > ============================================== > Sat Oct 06 21:38:18 [0.76 0.71 0.58] root@pbs2:/var/log/glusterfs/bricks > 568 $ gluster volume status > Volume gli is not started > ============================================== > > and since that is the case, other utilities also claim this: > > ============================================== > Sat Oct 06 21:41:25 [1.04 0.84 0.65] root@pbs2:/var/log/glusterfs/bricks > 571 $ gluster volume status gli detail > Volume gli is not started > ============================================== > > And since they think it's not started, I can't stop it. > > How is this resolvable? -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- Passive-Aggressive Supporter of the The Canada Party: <http://www.americabutbetter.com/> _______________________________________________ Gluster-users mailing list [email protected] http://supercolony.gluster.org/mailman/listinfo/gluster-users
