OK, thanks. If I have further follow-ups I'll attach them to that bug report.
.. Lana ([email protected]) On Mon, Dec 6, 2010 at 9:46 PM, Raghavendra G <[email protected]> wrote: > Hi Lana, > > We are able to reproduce the issue locally and are working on a fix to it. > Progress of the bug can be tracked at: > http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=2197. > > Thanks for your inputs. > > regards, > ----- Original Message ----- > From: "Lana Deere" <[email protected]> > To: "Raghavendra G" <[email protected]> > Cc: [email protected] > Sent: Tuesday, December 7, 2010 12:20:08 AM > Subject: Re: [Gluster-users] 3.1.1 crashing under moderate load > > One other observation is that it seems to be genuinely related to the > number of nodes involved. > > If I run, say, 50 instances of my script using 50 separate nodes, then > they almost always generate some failures. > > If I run the same number of instances, or even a much greater number, > but using only 10 separate nodes, then they seem always to work OK. > > Maybe this is due to some kind of caching behaviour? > > .. Lana ([email protected]) > > > > > > > On Mon, Dec 6, 2010 at 11:05 AM, Lana Deere <[email protected]> wrote: >> The gluster configuration is distribute, there are 4 server nodes. >> >> There are 53 physical client nodes in my setup, each with 8 cores; we >> want to sometimes run more than 400 client processes simultaneously. >> In practice we aren't yet trying that many. >> >> When I run the commands which break, I am running them on separate >> clients simultaneously. >> for host in <hosts>; do ssh $host script& done # Note the & >> When I run on 25 clients simultaneously so far I have not seen it >> fail. But if I run on 40 or 50 simultaneously it often fails. >> >> Sometimes I have run more than one command on each client >> simultaneously by listing all the hosts multiple times in the >> for-loop, >> for host in <hosts> <hosts> <hosts>; do ssh $host script& done >> In example of 3 at a time I have noticed that when a host works, all >> three on that client will work; but when it fails, all three will fail >> exactly the same fashion. >> >> I've attached a tarfile containing two sets of logs. In both cases I >> had rotated all the log files and rebooted everything then run my >> test. In the first set of logs, I went directly to approx. 50 >> simultaneous sessions, and pretty much all of them just hung. (When >> the find hangs, even a kill -9 will not unhang it.) So I rotated the >> logs again and rebooted everything, but this time I gradually worked >> my way up to higher loads. This time the failures were mostly cases >> with the wrong checksum but no error message, though some of them did >> give me errors like >> find: lib/kbd/unimaps/cp865.uni: Invalid argument >> >> Thanks. I may try downgrading to 3.1.0 just to see if I have the same >> problem there. >> >> >> .. Lana ([email protected]) >> >> >> >> >> >> >> On Mon, Dec 6, 2010 at 12:30 AM, Raghavendra G <[email protected]> >> wrote: >>> Hi Lana, >>> >>> I need some clarifications about test setup: >>> >>> * Are you seeing problem when there are more than 25 clients? If this is >>> the case, are these clients on different physical nodes or is it that more >>> than one client shares same node? In other words, clients are mounted on >>> how many physical nodes are there in your test setup? Also, are you running >>> the command on each of these clients simultaneously? >>> >>> * Or is it that there are more than 25 concurrent concurrent invocations of >>> the script? If this is the case, how many clients are present in your test >>> setup and on how many physical nodes these clients are mounted? >>> >>> regards, >>> ----- Original Message ----- >>> From: "Lana Deere" <[email protected]> >>> To: [email protected] >>> Sent: Saturday, December 4, 2010 12:13:30 AM >>> Subject: [Gluster-users] 3.1.1 crashing under moderate load >>> >>> I'm running GlusterFS 3.1.1, CentOS5.5 servers, CentOS5.4 clients, RDMA >>> transport, native/fuse access. >>> >>> I have a directory which is shared on the gluster. In fact, it is a clone >>> of /lib from one of the clients, shared so all can see it. >>> >>> I have a script which does >>> find lib -type f -print0 | xargs -0 sum | md5sum >>> >>> If I run this on my clients one at a time, they all yield the same md5sum: >>> for h in <<hosts>>; do ssh $host script; done >>> >>> If I run this on my clients concurrently, up to roughly 25 at a time they >>> still yield the same md5sum. >>> for h in <<hosts>>; do ssh $host script& done >>> >>> Beyond that the gluster share often, but not always, fails. The errors >>> vary. >>> - sometimes I get "sum: xxx.so not found" >>> - sometimes I get the wrong checksum without any error message >>> - sometimes the job simply hangs until I kill it >>> >>> >>> Some of the server logs show messages like these from the time of the >>> failures (other servers show nothing from around that time): >>> >>> [2010-12-03 10:03:06.34328] E [rdma.c:4442:rdma_event_handler] >>> rpc-transport/rdma: rdma.RaidData-server: pollin received on tcp >>> socket (peer: 10.54.255.240:1022) after handshake is complete >>> [2010-12-03 10:03:06.34363] E [rpcsvc.c:1548:rpcsvc_submit_generic] >>> rpc-service: failed to submit message (XID: 0x55e82, Program: >>> GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport >>> (rdma.RaidData-server) >>> [2010-12-03 10:03:06.34377] E [server.c:137:server_submit_reply] : >>> Reply submission failed >>> [2010-12-03 10:03:06.34464] E [rpcsvc.c:1548:rpcsvc_submit_generic] >>> rpc-service: failed to submit message (XID: 0x55e83, Program: >>> GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport >>> (rdma.RaidData-server) >>> [2010-12-03 10:03:06.34520] E [server.c:137:server_submit_reply] : >>> Reply submission failed >>> >>> >>> On a client which had a failure I see messages like: >>> >>> [2010-12-03 10:03:06.21290] E [rdma.c:4442:rdma_event_handler] >>> rpc-transport/rdma: RaidData-client-1: pollin received on tcp socket >>> (peer: 10.54.50.101:24009) after handshake is complete >>> [2010-12-03 10:03:06.21776] E [rpc-clnt.c:338:saved_frames_unwind] >>> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x3814a0f769] >>> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) >>> [0x3814a0ef1e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) >>> [0x3814a0ee8e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) >>> op(READ(12)) called at 2010-12-03 10:03:06.20492 >>> [2010-12-03 10:03:06.21821] E [rpc-clnt.c:338:saved_frames_unwind] >>> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x3814a0f769] >>> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) >>> [0x3814a0ef1e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) >>> [0x3814a0ee8e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) >>> op(READ(12)) called at 2010-12-03 10:03:06.20529 >>> [2010-12-03 10:03:06.26827] I >>> [client-handshake.c:993:select_server_supported_programs] >>> RaidData-client-1: Using Program GlusterFS-3.1.0, Num (1298437), >>> Version (310) >>> [2010-12-03 10:03:06.27029] I >>> [client-handshake.c:829:client_setvolume_cbk] RaidData-client-1: >>> Connected to 10.54.50.101:24009, attached to remote volume '/data'. >>> [2010-12-03 10:03:06.27067] I >>> [client-handshake.c:698:client_post_handshake] RaidData-client-1: 2 >>> fds open - Delaying child_up until they are re-opened >>> >>> >>> Anyone else seen anything like this and/or have suggestions about options I >>> can >>> set to work around this? >>> >>> >>> .. Lana ([email protected]) >>> _______________________________________________ >>> Gluster-users mailing list >>> [email protected] >>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users >>> >> > _______________________________________________ Gluster-users mailing list [email protected] http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
