Allright, it seems we're fine now!
We basically took two actions and the network issue seems gone. 1. These servers are VM on a cloud provider, so I don't really have access to details here. The assigned sysadmin reported that one of my Gluster VMs were on a crowded host, and that could be potentially been affecting on both load (CPU/memory) and network performance. He moved this one VM to a new (and more free) host. The other VM that is part of this gluster setup was kept as before. 2. I set up a new internet-isolated sub-net between these VMs, allowing me to get firewall out of the way. It seems #1 was the responsible, and #2 was an achieved nice-to-have. Before: root@web3:~# date; time ls -ltrh /var/www/site-images/templates/assets/prod/temporary/13/user_1339200.png Mon Jan 26 07:00:27 PST 2015 -rwx---r-- 1 mhmadmin mhmadmin 61K Jan 22 14:37 /var/www/site-images/templates/assets/prod/temporary/13/user_1339200.png real 0m*33.651s* user 0m0.001s sys 0m0.004s After: root@web3:~# date; time ls -ltrh /var/www/site-images/templates/assets/prod/temporary/13/user_1410560.png Tue Feb 10 15:28:18 PST 2015 -rwx---r-- 1 mhmadmin mhmadmin 17K Feb 10 12:41 /var/www/site-images/templates/assets/prod/temporary/13/user_1410560.png real 0m*0.031s* user 0m0.001s sys 0m0.006s The case seems closed. If you guys have any questions that I know the answer or can reply, please let me know. Thanks Anirban, Joe and selected audience :) -- *Tiago Santos* On Wed, Jan 28, 2015 at 2:45 PM, Tiago Santos <[email protected]> wrote: > Since I stopped writing to the clients (so I could cleanly work on the > split brain) I got no more entries on /var/log/gluster.log (this is the > client log, right?) > > > While working with diff command in order to fix the split brain, I saw > several entries like these: > > diff: r2/webhost/sites/clipart/assets/apache/images/13/templates/558482: > Transport endpoint is not connected > diff: r2/webhost/sites/clipart/assets/apache/images/13/templates/558483: > Transport endpoint is not connected > diff: r2/webhost/sites/clipart/assets/apache/images/13/templates/558484: > Transport endpoint is not connected > > They happen a lot, then stops. Then happen again and so on. > > At the same time the errors are showing, ping from the system I'm working > on split-brain to the system that is failing to connect (r2) shows this: > > 64 bytes from r2-server (r2-ip): icmp_seq=662 ttl=64 time=1.21 ms > 64 bytes from r2-server (r2-ip): icmp_seq=663 ttl=64 time=0.990 ms > 64 bytes from r2-server (r2-ip): icmp_seq=664 ttl=64 time=1.01 ms > > I know this is a very trivial network checking that may not be showing me > what I want to see, and I'm working on more elaborated one. But I'm > completely open for suggestions on how to properly do that in order to > verify if this is issue when talking about gluster. > > > So far, thank you so much, guys! > > > > On Mon, Jan 26, 2015 at 8:36 PM, Joe Julian <[email protected]> wrote: > >> Check your client logs. Perhaps the client isn't actually connecting to >> both servers. >> >> On 01/26/2015 02:12 PM, Tiago Santos wrote: >> >> That's what I meant. Sorry for the confusion. >> >> I'm writing on Client1 (same server as Brick1). Client2 (mounted Brick2, >> on server2) has nothing writing to it (so far). >> >> My wondering is how I went up on having a split-brain if I'm only >> writing on one client. >> >> >> >> >> >> On Mon, Jan 26, 2015 at 8:04 PM, Joe Julian <[email protected]> wrote: >> >>> Nothing but GlusterFS should be writing to bricks. Mount a client and >>> write there. >>> >>> >>> On 01/26/2015 01:38 PM, Tiago Santos wrote: >>> >>> Right. >>> >>> I have Brick1 being constantly written. But I have nothing writing on >>> Brick2. It just get "healed" data from Brick1. >>> >>> This setup is still not in production, and there's no applications >>> using that data. I have rsyncs constantly updating Brick1 (bring data from >>> production servers), and then Gluster updates Brick2. >>> >>> Which makes me wonder how may I be creating multiple replicas during a >>> split-brain. >>> >>> >>> It may be the case that, having a split-brain event, I may be updating >>> versions of the same file on Brick1 (only), and Gluster understands it as >>> different versions and things get confuse? >>> >>> >>> Anyways, while we talk I'm gonna run Joe's precious procedure on >>> split-brain recovery. >>> >>> >>> >>> >>> >>> On Mon, Jan 26, 2015 at 7:23 PM, Joe Julian <[email protected]> >>> wrote: >>> >>>> Mismatched GFIDs would happen if a file is created on multiple replicas >>>> during a split-brain event. The GFID is assigned at file creation. >>>> >>>> >>>> On 01/26/2015 01:04 PM, A Ghoshal wrote: >>>> >>>>> Yep, so it is indeed a split-brain caused by a mismatch of the >>>>> trusted.gfid attribute. >>>>> >>>>> Sadly, I don't know precisely what causes it. -Communication loss >>>>> might be one of the triggers. I am guessing the files with the problem are >>>>> dynamic, correct? In our setup (also replica 2), communication is never a >>>>> problem but we do see this when one of the server takes a reboot. Maybe >>>>> some obscure and difficult to understand race between background self-heal >>>>> and the self heal daemon... >>>>> >>>>> In any case, a normal procedure for split brain recovery would work >>>>> for you if you wish to get you files back in function. It's easy to find >>>>> on >>>>> google. I use the instructions on Joe Julian's blog page myself. >>>>> >>>>> >>>>> -----Tiago Santos <[email protected]> wrote: ----- >>>>> >>>>> ======================= >>>>> To: A Ghoshal <[email protected]> >>>>> From: Tiago Santos <[email protected]> >>>>> Date: 01/27/2015 02:11AM >>>>> Cc: gluster-users <[email protected]> >>>>> Subject: Re: [Gluster-users] Pretty much any operation related to >>>>> Gluster mounted fs hangs for a while >>>>> ======================= >>>>> Oh, right! >>>>> >>>>> Follow the outputs: >>>>> >>>>> >>>>> root@web3:/export/images1-1/brick# time getfattr -m . -d -e hex >>>>> templates/assets/prod/temporary/13/user_1339200.png >>>>> # file: templates/assets/prod/temporary/13/user_1339200.png >>>>> trusted.afr.site-images-client-0=0x000000000000000400000000 >>>>> trusted.afr.site-images-client-1=0x000000020000000900000000 >>>>> trusted.gfid=0x10e5894c474a4cb1898b71e872cdf527 >>>>> >>>>> real 0m0.024s >>>>> user 0m0.001s >>>>> sys 0m0.001s >>>>> >>>>> >>>>> >>>>> root@web4:/export/images2-1/brick# time getfattr -m . -d -e hex >>>>> templates/assets/prod/temporary/13/user_1339200.png >>>>> # file: templates/assets/prod/temporary/13/user_1339200.png >>>>> trusted.afr.site-images-client-0=0x000000000000000000000000 >>>>> trusted.afr.site-images-client-1=0x000000000000000000000000 >>>>> trusted.gfid=0xd02f14fcb6724ceba4a330eb606910f3 >>>>> >>>>> real 0m0.003s >>>>> user 0m0.000s >>>>> sys 0m0.006s >>>>> >>>>> >>>>> Not sure exactly what that means. I'm googling, and would appreciate >>>>> if you >>>>> guys can bring some light. >>>>> >>>>> Thanks! >>>>> -- >>>>> Tiago >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, Jan 26, 2015 at 6:16 PM, A Ghoshal <[email protected]> wrote: >>>>> >>>>> Actually you ran getfattr on the volume - which is why the requisite >>>>>> extended attributes never showed up... >>>>>> >>>>>> Your bricks are mounted elsewhere. >>>>>> /exports/images1-1/brick, and exports/images2-1/brick >>>>>> >>>>>> Btw, what version of Linux do you use? And, are the files you observe >>>>>> the >>>>>> input/output errors on soft-links? >>>>>> >>>>>> -----Tiago Santos <[email protected]> wrote: ----- >>>>>> >>>>>> ======================= >>>>>> To: A Ghoshal <[email protected]> >>>>>> From: Tiago Santos <[email protected]> >>>>>> Date: 01/27/2015 12:20AM >>>>>> Cc: gluster-users <[email protected]> >>>>>> Subject: Re: [Gluster-users] Pretty much any operation related to >>>>>> Gluster >>>>>> mounted fs hangs for a while >>>>>> ======================= >>>>>> Thanks for you input, Anirban. >>>>>> >>>>>> I ran the commands on both servers, with the following results: >>>>>> >>>>>> >>>>>> root@web3:/var/www/site-images# time getfattr -m . -d -e hex >>>>>> templates/assets/prod/temporary/13/user_1339200.png >>>>>> >>>>>> real 0m34.524s >>>>>> user 0m0.004s >>>>>> sys 0m0.000s >>>>>> >>>>>> >>>>>> root@web4:/var/www/site-images# time getfattr -m . -d -e hex >>>>>> templates/assets/prod/temporary/13/user_1339200.png >>>>>> getfattr: templates/assets/prod/temporary/13/user_1339200.png: >>>>>> Input/output >>>>>> error >>>>>> >>>>>> real 0m11.315s >>>>>> user 0m0.001s >>>>>> sys 0m0.003s >>>>>> root@web4:/var/www/site-images# ls >>>>>> templates/assets/prod/temporary/13/user_1339200.png >>>>>> ls: cannot access templates/assets/prod/temporary/13/user_1339200.png: >>>>>> Input/output error >>>>>> >>>>>
_______________________________________________ Gluster-users mailing list [email protected] http://www.gluster.org/mailman/listinfo/gluster-users
