And a few more data points: it appears the reason for the flaky gluster fs is 
that not all the servers are running glusterfsd's (see below).  Is there a way 
to force the servers to all start the glusterfsd's as they're supposed to?

The mystery rebalance did complete, and seems to have fixed some but not all 
problem files - ie:

> drwx------ 2 spoorkas spoorkas  8211 Jun  2 00:22 QPSK_2Tx_2Rx_BH_Method2/
> ?--------- ? ?        ?            ?            ? QPSK_2Tx_2Rx_ML_Method1

And the started/not started status has gotten weirder if possble..

The gluster volume is still being exported to clients, despite gluster 
insisting that the volume is not started (servers are pbs[1234]
result of 
$ gluster volume status
pbs1:Volume gli is not started
pbs2:Volume gli is not started
pbs3:Volume gli is not started
pbs4:Volume gli is not started

$ gluster volume info:
pbs1:Status: Stopped
pbs2:Status: Started  <- aha!
pbs3:Status: Started  <- aha!
pbs4:Status: Started

This correlates with the glusterfsd status in which only pbs[23] are running 
glusterfsd:

pbs2:root      1799  0.1  0.0 184296 16464 ?        Ssl  13:07   0:06 
/usr/sbin/glusterfsd -s localhost --volfile-id gli.pbs2ib.bducgl -p 
/var/lib/glusterd/vols/gli/run/pbs2ib-bducgl.pid -S 
/tmp/c70b2f910e2fe1bb485b1d76ef63e3db.socket --brick-name /bducgl -l 
/var/log/glusterfs/bricks/bducgl.log --xlator-option *-posix.glusterd-
uuid=26de63bd-c5b7-48ba-b81d-5d77a533d077 --brick-port 24025 24026 --xlator-
option gli-server.transport.rdma.listen-port=24026 --xlator-option gli-
server.listen-port=24025

pbs3:root      1751  0.1  0.0 184168 16468 ?        Ssl  13:07   0:06 
/usr/sbin/glusterfsd -s localhost --volfile-id gli.pbs3ib.bducgl -p 
/var/lib/glusterd/vols/gli/run/pbs3ib-bducgl.pid -S 
/tmp/7096377992feb7f5a7805cafd82c3100.socket --brick-name /bducgl -l 
/var/log/glusterfs/bricks/bducgl.log --xlator-option *-posix.glusterd-
uuid=c79c4084-d6b9-4af9-b975-40dd6aa99b42 --brick-port 24018 24020 --xlator-
option gli-server.transport.rdma.listen-port=24020 --xlator-option gli-
server.listen-port=24018

pbs[14] are only running the glusterd process, not any glusterfsd's

In previous startups, pbs4 WAS running a glusterfsd, but pbs1 has not run one 
since the powerdown AFAIK.


The glusterfsd man page suggests how to run glusterfsd manually, but my 
attempts to do have not been successful, either by running the previously 
running glusterfsd command, ie:

/usr/sbin/glusterfsd -s localhost --volfile-id gli.pbs4ib.bducgl \
-p /var/lib/glusterd/vols/gli/run/pbs4ib-bducgl.pid 
-S /tmp/c949ff0eb195fea64311730525afed68.socket \
--brick-name /bducgl -l /var/log/glusterfs/bricks/bducgl.log \
--xlator-option *-posix.glusterd-uuid=2a593581-bf45-446c-8f7c-212c53297803 \
--brick-port 24009 24010 \
--xlator-option gli-server.transport.rdma.listen-port=24010 \
--xlator-option gli-server.listen-port=24009


or by running a minimal version thereof

/usr/sbin/glusterfsd -s localhost \
-l /var/log/glusterfs/bricks/bducgl.log \
--volfile-id gli.pbs4ib.bducgl \
-p /var/lib/glusterd/vols/gli/run/pbs4ib-bducgl.pid \
--brick-name /bducgl 

Is there a trick to convince the glusterfsd to stay up?


hjm


On Saturday, October 06, 2012 10:19:14 PM harry mangalam wrote:
> ...and should have added:
> 
> the rebalance log (the volume claimed to be rebalancing before I shut it
> down but was idle or wedged at that time) is active as well with about 1
> warning of a "1 subvolumes down -- not fixing" for every 3 informational
> messages:
> 
>  2012-10-06 22:05:35.396650] I [dht-rebalance.c:1058:gf_defrag_migrate_data]
> 0-gli-dht: migrate data called on /nlduong/nduong2-t-
> illiac/workspace/m5_sim/trunk/src/arch/.svn/tmp/wcprops
> 
> [2012-10-06 22:05:35.451925] I [dht-layout.c:593:dht_layout_normalize]
> 0-gli- dht: found anomalies in /nlduong/nduong2-t-
> illiac/workspace/m5_sim/trunk/src/arch/.svn/wcprops. holes=1 overlaps=0
> 
> [2012-10-06 22:05:35.451957] W [dht-selfheal.c:875:dht_selfheal_directory]
> 0- gli-dht: 1 subvolumes down -- not fixing
> 
> 
> previously...
> 
> gluster 3.3, running on ubuntu 10.04, was running OK, had to shut down for a
> power outage.
> 
> When I tried to shut it down, it insisted that it was rebalancing, but
> seeemed wedged - no activity in the logs.
> 
> Was able to shut it down tho.
> 
> After power was restored, tried to restart the volume but altho the 4 peers
> claimed to be visible and could ping each other etc:
> ==============================================
> Sat Oct 06 21:38:07 [0.81 0.71 0.58]  root@pbs2:/var/log/glusterfs/bricks
> 567 $ gluster peer status
> Number of Peers: 3
> 
> Hostname: pbs3ib
> Uuid: c79c4084-d6b9-4af9-b975-40dd6aa99b42
> State: Peer in Cluster (Connected)
> 
> Hostname: 10.255.77.2
> Uuid: 3fcd023c-9cc9-4d1c-84c4-babfb4492e38
> State: Peer in Cluster (Connected)
> 
> Hostname: pbs4ib
> Uuid: 2a593581-bf45-446c-8f7c-212c53297803
> State: Peer in Cluster (Connected)
> ==============================================
> 
> and the volume info seemed to be OK:
> ==============================================
> Sat Oct 06 21:36:11 [0.75 0.67 0.56]  root@pbs2:/var/log/glusterfs/bricks
> 565 $ gluster volume info gli
> 
> Volume Name: gli
> Type: Distribute
> Volume ID: 76cc5e88-0ac4-42ac-a4a3-31bf2ba611d4
> Status: Started
> Number of Bricks: 4
> Transport-type: tcp,rdma
> Bricks:
> Brick1: pbs1ib:/bducgl
> Brick2: pbs2ib:/bducgl
> Brick3: pbs3ib:/bducgl
> Brick4: pbs4ib:/bducgl
> Options Reconfigured:
> performance.write-behind-window-size: 1024MB
> performance.flush-behind: on
> performance.cache-size: 268435456
> nfs.disable: on
> performance.io-thread-count: 64
> performance.quick-read: on
> performance.io-cache: on
> 
> ==============================================
> some utilities claim that it was not started, even tho some clients /are
> using the volume/ (tho there are some file oddities)
> (from a client):
> 
> -rw-r--r-- 1 hmangala hmangala       32935 Jun 23  2010 INSTALL.txt
> ?--------- ? ?        ?                  ?            ? R-2.15.0
> drwxr-xr-x 2 hmangala hmangala          18 Sep 10 14:20 bonnie/
> drwxr-xr-x 2 root     root              18 Sep 10 13:41 bonnie2/
> 
> drwx------ 2 spoorkas spoorkas  8211 Jun  2 00:22 QPSK_2Tx_2Rx_BH_Method2/
> ?--------- ? ?        ?            ?            ? QPSK_2Tx_2Rx_ML_Method1
> drwx------ 2 spoorkas spoorkas  8237 Jun  3 11:22 QPSK_2Tx_2Rx_ML_Method2/
> drwx------ 2 spoorkas spoorkas 12288 Jun  4 01:24 QPSK_2Tx_3Rx_BH/
> drwx------ 2 spoorkas spoorkas  4232 Jun  2 00:26 QPSK_2Tx_3Rx_BH_Method1/
> drwx------ 2 spoorkas spoorkas  8274 Jun  2 00:34 QPSK_2Tx_3Rx_BH_Method2/
> ?--------- ? ?        ?            ?            ? QPSK_2Tx_3Rx_ML_Method1
> ?--------- ? ?        ?            ?            ? QPSK_2Tx_3Rx_ML_Method2
> -rw-r--r-- 1 spoorkas spoorkas     0 Apr 17 14:16 simple.sh.e1802207
> 
> (These files appear to be intact on the individual bricks tho.)
> 
> ==============================================
> Sat Oct 06 21:38:18 [0.76 0.71 0.58]  root@pbs2:/var/log/glusterfs/bricks
> 568 $ gluster volume status
> Volume gli is not started
> ==============================================
> 
> and since that is the case, other utilities also claim this:
> 
> ==============================================
> Sat Oct 06 21:41:25 [1.04 0.84 0.65]  root@pbs2:/var/log/glusterfs/bricks
> 571 $ gluster volume status gli detail
> Volume gli is not started
> ==============================================
> 
> And since they think it's not started, I can't stop it.
> 
> How is this resolvable?
-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
Passive-Aggressive Supporter of the The Canada Party:
  <http://www.americabutbetter.com/>
_______________________________________________
Gluster-users mailing list
[email protected]
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Reply via email to