Re: [Gluster-users] 3.1.1 crashing under moderate load

Lana Deere Mon, 06 Dec 2010 23:42:35 -0800

OK, thanks.  If I have further follow-ups I'll attach them to that bug report.


.. Lana ([email protected])






On Mon, Dec 6, 2010 at 9:46 PM, Raghavendra G <[email protected]> wrote:
> Hi Lana,
>
> We are able to reproduce the issue locally and are working on a fix to it. 
> Progress of the bug can be tracked at:
> http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=2197.
>
> Thanks for your inputs.
>
> regards,
> ----- Original Message -----
> From: "Lana Deere" <[email protected]>
> To: "Raghavendra G" <[email protected]>
> Cc: [email protected]
> Sent: Tuesday, December 7, 2010 12:20:08 AM
> Subject: Re: [Gluster-users] 3.1.1 crashing under moderate load
>
> One other observation is that it seems to be genuinely related to the
> number of nodes involved.
>
> If I run, say, 50 instances of my script using 50 separate nodes, then
> they almost always generate some failures.
>
> If I run the same number of instances, or even a much greater number,
> but using only 10 separate nodes, then they seem always to work OK.
>
> Maybe this is due to some kind of caching behaviour?
>
> .. Lana ([email protected])
>
>
>
>
>
>
> On Mon, Dec 6, 2010 at 11:05 AM, Lana Deere <[email protected]> wrote:
>> The gluster configuration is distribute, there are 4 server nodes.
>>
>> There are 53 physical client nodes in my setup, each with 8 cores; we
>> want to sometimes run more than 400 client processes simultaneously.
>> In practice we aren't yet trying that many.
>>
>> When I run the commands which break, I am running them on separate
>> clients simultaneously.
>>    for host in <hosts>; do ssh $host script& done  # Note the &
>> When I run on 25 clients simultaneously so far I have not seen it
>> fail.  But if I run on 40 or 50 simultaneously it often fails.
>>
>> Sometimes I have run more than one command on each client
>> simultaneously by listing all the hosts multiple times in the
>> for-loop,
>>   for host in <hosts> <hosts> <hosts>; do ssh $host script& done
>> In example of 3 at a time I have noticed that when a host works, all
>> three on that client will work; but when it fails, all three will fail
>> exactly the same fashion.
>>
>> I've attached a tarfile containing two sets of logs.  In both cases I
>> had rotated all the log files and rebooted everything then run my
>> test.  In the first set of logs, I went directly to approx. 50
>> simultaneous sessions, and pretty much all of them just hung.  (When
>> the find hangs, even a kill -9 will not unhang it.)  So I rotated the
>> logs again and rebooted everything, but this time I gradually worked
>> my way up to higher loads.  This time the failures were mostly cases
>> with the wrong checksum but no error message, though some of them did
>> give me errors like
>>    find: lib/kbd/unimaps/cp865.uni: Invalid argument
>>
>> Thanks.  I may try downgrading to 3.1.0 just to see if I have the same
>> problem there.
>>
>>
>> .. Lana ([email protected])
>>
>>
>>
>>
>>
>>
>> On Mon, Dec 6, 2010 at 12:30 AM, Raghavendra G <[email protected]> 
>> wrote:
>>> Hi Lana,
>>>
>>> I need some clarifications about test setup:
>>>
>>> * Are you seeing problem when there are more than 25 clients? If this is 
>>> the case, are these clients on different physical nodes or is it that more 
>>> than one client shares same node? In other words, clients are mounted on 
>>> how many physical nodes are there in your test setup? Also, are you running 
>>> the command on each of these clients simultaneously?
>>>
>>> * Or is it that there are more than 25 concurrent concurrent invocations of 
>>> the script? If this is the case, how many clients are present in your test 
>>> setup and on how many physical nodes these clients are mounted?
>>>
>>> regards,
>>> ----- Original Message -----
>>> From: "Lana Deere" <[email protected]>
>>> To: [email protected]
>>> Sent: Saturday, December 4, 2010 12:13:30 AM
>>> Subject: [Gluster-users] 3.1.1 crashing under moderate load
>>>
>>> I'm running GlusterFS 3.1.1, CentOS5.5 servers, CentOS5.4 clients, RDMA
>>> transport, native/fuse access.
>>>
>>> I have a directory which is shared on the gluster.  In fact, it is a clone
>>> of /lib from one of the clients, shared so all can see it.
>>>
>>> I have a script which does
>>>    find lib -type f -print0 | xargs -0 sum | md5sum
>>>
>>> If I run this on my clients one at a time, they all yield the same md5sum:
>>>    for h in <<hosts>>; do ssh $host script; done
>>>
>>> If I run this on my clients concurrently, up to roughly 25 at a time they
>>> still yield the same md5sum.
>>>    for h in <<hosts>>; do ssh $host script& done
>>>
>>> Beyond that the gluster share often, but not always, fails.  The errors 
>>> vary.
>>>    - sometimes I get "sum: xxx.so not found"
>>>    - sometimes I get the wrong checksum without any error message
>>>    - sometimes the job simply hangs until I kill it
>>>
>>>
>>> Some of the server logs show messages like these from the time of the
>>> failures (other servers show nothing from around that time):
>>>
>>> [2010-12-03 10:03:06.34328] E [rdma.c:4442:rdma_event_handler]
>>> rpc-transport/rdma: rdma.RaidData-server: pollin received on tcp
>>> socket (peer: 10.54.255.240:1022) after handshake is complete
>>> [2010-12-03 10:03:06.34363] E [rpcsvc.c:1548:rpcsvc_submit_generic]
>>> rpc-service: failed to submit message (XID: 0x55e82, Program:
>>> GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport
>>> (rdma.RaidData-server)
>>> [2010-12-03 10:03:06.34377] E [server.c:137:server_submit_reply] :
>>> Reply submission failed
>>> [2010-12-03 10:03:06.34464] E [rpcsvc.c:1548:rpcsvc_submit_generic]
>>> rpc-service: failed to submit message (XID: 0x55e83, Program:
>>> GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport
>>> (rdma.RaidData-server)
>>> [2010-12-03 10:03:06.34520] E [server.c:137:server_submit_reply] :
>>> Reply submission failed
>>>
>>>
>>> On a client which had a failure I see messages like:
>>>
>>> [2010-12-03 10:03:06.21290] E [rdma.c:4442:rdma_event_handler]
>>> rpc-transport/rdma: RaidData-client-1: pollin received on tcp socket
>>> (peer: 10.54.50.101:24009) after handshake is complete
>>> [2010-12-03 10:03:06.21776] E [rpc-clnt.c:338:saved_frames_unwind]
>>> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x3814a0f769]
>>> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e)
>>> [0x3814a0ef1e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)
>>> [0x3814a0ee8e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1)
>>> op(READ(12)) called at 2010-12-03 10:03:06.20492
>>> [2010-12-03 10:03:06.21821] E [rpc-clnt.c:338:saved_frames_unwind]
>>> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x3814a0f769]
>>> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e)
>>> [0x3814a0ef1e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)
>>> [0x3814a0ee8e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1)
>>> op(READ(12)) called at 2010-12-03 10:03:06.20529
>>> [2010-12-03 10:03:06.26827] I
>>> [client-handshake.c:993:select_server_supported_programs]
>>> RaidData-client-1: Using Program GlusterFS-3.1.0, Num (1298437),
>>> Version (310)
>>> [2010-12-03 10:03:06.27029] I
>>> [client-handshake.c:829:client_setvolume_cbk] RaidData-client-1:
>>> Connected to 10.54.50.101:24009, attached to remote volume '/data'.
>>> [2010-12-03 10:03:06.27067] I
>>> [client-handshake.c:698:client_post_handshake] RaidData-client-1: 2
>>> fds open - Delaying child_up until they are re-opened
>>>
>>>
>>> Anyone else seen anything like this and/or have suggestions about options I 
>>> can
>>> set to work around this?
>>>
>>>
>>> .. Lana ([email protected])
>>> _______________________________________________
>>> Gluster-users mailing list
>>> [email protected]
>>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>>>
>>
>
_______________________________________________
Gluster-users mailing list
[email protected]
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] 3.1.1 crashing under moderate load

Reply via email to