Avati, Unfortunately this fix does not fix this problem. I am running mainline-2.4-patch-175 on all servers and clients. The find -exec test still fails to complete every time I run it. Once it ended in a segfault on the client (log was emailed to you). Other times it fails and recovers after about 1 min. The same errors are occurring on the bricks when the process fails. I am still unable to complete a 2hr reindex mailbox task on the Zimbra machine because of a disconnect/reconnect about an hr in. This is also a read intensive task with no writing. I will report on the other spurious disconnects after running for 24 hours.
Harris ----- Original Message ----- From: "Anand Avati" <[EMAIL PROTECTED]> To: "Harris Landgarten" <[EMAIL PROTECTED]> Cc: "gluster-devel" <[email protected]> Sent: Monday, June 4, 2007 8:44:54 AM (GMT-0500) America/New_York Subject: Re: [Gluster-devel] no such file of directory errors running find -exec on large trees Harris, a bug fix in the timer code has been committed which used to cause spurious transport timeout events causing false bailing. please try with the latest tla checkout from glusterfs--mainline--2.4. thanks, avati 2007/5/26, Harris Landgarten <[EMAIL PROTECTED]>: > I have been running some tests on a gluster tree with over 100,000 files > using find. When I run find /mnt/gluster -type f it completes without error. > When I combine the find with -exec that causes the files to be read (ie find > /mnt/gluster -type f -exec head -n1 {} \; ) The process fails to complete > 100% of the time with "No such file or directory" errors. The errors do not > occur until after at least 30,000 files have been processed. After the > process fails the clients and servers still appear to be running properly. > The errors seem to coincide with the following in the logs: > > Client running process: > > [May 25 10:11:14] [ERROR/client-protocol.c:204/client_protocol_xfer()] > protocol/client:transport_submit failed > [May 25 10:11:14] [ERROR/client-protocol.c:204/client_protocol_xfer()] > protocol/client:transport_submit failed > [May 25 10:11:14] [ERROR/client-protocol.c:204/client_protocol_xfer()] > protocol/client:transport_submit failed > [May 25 10:11:14] [ERROR/client-protocol.c:204/client_protocol_xfer()] > protocol/client:transport_submit failed > [May 25 10:11:14] [ERROR/client-protocol.c:204/client_protocol_xfer()] > protocol/client:transport_submit failed > [May 25 10:11:14] [ERROR/client-protocol.c:204/client_protocol_xfer()] > protocol/client:transport_submit failed > ... > [May 25 10:12:34] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:client2: > connection to server disconnected > [May 25 10:12:34] [ERROR/client-protocol.c:204/client_protocol_xfer()] > protocol/client:transport_submit failed > [May 25 10:12:34] [ERROR/client-protocol.c:204/client_protocol_xfer()] > protocol/client:transport_submit failed > [May 25 10:12:34] [CRITICAL/client-protocol.c:218/call_bail()] > client/protocol:bailing transport > [May 25 10:12:34] [CRITICAL/client-protocol.c:218/call_bail()] > client/protocol:bailing transport > [May 25 10:12:34] [CRITICAL/client-protocol.c:218/call_bail()] > client/protocol:bailing transport > [May 25 10:12:34] [CRITICAL/client-protocol.c:218/call_bail()] > client/protocol:bailing transport > [May 25 10:12:34] [CRITICAL/client-protocol.c:218/call_bail()] > client/protocol:bailing transport > [May 25 10:12:34] [CRITICAL/client-protocol.c:218/call_bail()] > client/protocol:bailing transport > [May 25 10:12:34] [CRITICAL/client-protocol.c:218/call_bail()] > client/protocol:bailing transport > [May 25 10:12:34] [ERROR/tcp.c:110/tcp_except()] transport/tcp:shutdown () - > error: Transport endpoint is not connected > [May 25 10:12:34] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 > bytes r/w instead of 113 (errno=104) > [May 25 10:12:34] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:client2: > connection to server disconnected > [May 25 10:12:34] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 > bytes r/w instead of 113 (errno=9) > ... > > brick1: > > [May 25 10:11:26] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 > bytes r/w instead of 113 (errno=2) > [May 25 10:11:26] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:server: > connection to server disconnected > [May 25 10:11:26] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 > bytes r/w instead of 113 (errno=2) > [May 25 10:11:26] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:server: > connection to server disconnected > [May 25 10:11:26] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 > bytes r/w instead of 113 (errno=2) > [May 25 10:11:26] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:server: > connection to server disconnected > [May 25 10:11:26] [ERROR/common-utils.c:110/full_rwv()] > libglusterfs:full_rwv: 50680 bytes r/w instead of 124559 (Broken pipe) > [May 25 10:11:26] [ERROR/proto-srv.c:117/generic_reply()] > protocol/server:transport_writev failed > [May 25 10:11:26] [ERROR/tcp.c:110/tcp_except()] transport/tcp:shutdown () - > error: Transport endpoint is not connected > [May 25 10:11:26] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 > bytes r/w instead of 113 (errno=107) > [May 25 10:11:26] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:server: > connection to server disconnected > [May 25 10:11:43] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 > bytes r/w instead of 113 (errno=9) > ... > > brick2: > > [May 25 10:11:25] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 > bytes r/w instead of 113 (errno=2) > [May 25 10:11:25] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:server: > connection to server disconnected > [May 25 10:11:25] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 > bytes r/w instead of 113 (errno=9) > [May 25 10:11:25] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:server: > connection to server disconnected > [May 25 10:11:43] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 > bytes r/w instead of 113 (errno=2) > [May 25 10:11:43] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:server: > connection to server disconnected > [May 25 10:12:46] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 > bytes r/w instead of 113 (errno=2) > [May 25 10:12:46] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:server: > connection to server disconnected > [May 25 10:12:46] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 > bytes r/w instead of 113 (errno=2) > [May 25 10:12:46] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:server: > connection to server disconnected > [May 25 10:12:46] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 > bytes r/w instead of 113 (errno=2) > [May 25 10:12:46] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:server: > connection to server disconnected > [May 25 10:12:46] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 > bytes r/w instead of 113 (errno=9) > [May 25 10:12:46] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:server: > connection to server disconnected > [May 25 10:12:46] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 > bytes r/w instead of 113 (errno=9) > [May 25 10:12:46] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:server: > connection to server disconnected > [May 25 10:12:46] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 > bytes r/w instead of 113 (errno=9) > [May 25 10:12:46] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:server: > connection to server disconnected > ... > > Client2 running (logging = debug) > > [May 25 10:12:06] [CRITICAL/client-protocol.c:218/call_bail()] > client/protocol:bailing transport > [May 25 10:12:06] [DEBUG/tcp.c:123/cont_hand()] tcp:forcing poll/read/write > to break on blocked socket (if any) > [May 25 10:12:06] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 > bytes r/w instead of 113 (errno=115) > [May 25 10:12:06] [DEBUG/protocol.c:244/gf_block_unserialize_transport()] > libglusterfs/protocol:gf_block_unserialize_transport: full_read of header > failed > [May 25 10:12:06] [DEBUG/client-protocol.c:2605/client_protocol_cleanup()] > protocol/client:cleaning up state in transport object 0x8077418 > [May 25 10:12:06] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:client1: > connection to server disconnected > [May 25 10:12:06] [DEBUG/tcp-client.c:180/tcp_connect()] transport: tcp: > :try_connect: socket fd = 4 > [May 25 10:12:06] [DEBUG/tcp-client.c:202/tcp_connect()] transport: tcp: > :try_connect: finalized on port `1023' > [May 25 10:12:06] [DEBUG/tcp-client.c:226/tcp_connect()] > tcp/client:try_connect: defaulting remote-port to 6996 > [May 25 10:12:06] [DEBUG/tcp-client.c:262/tcp_connect()] tcp/client:connect > on 4 in progress (non-blocking) > [May 25 10:12:06] [DEBUG/tcp-client.c:301/tcp_connect()] > tcp/client:connection on 4 still in progress - try later > [May 25 10:12:06] [ERROR/client-protocol.c:204/client_protocol_xfer()] > protocol/client:transport_submit failed > [May 25 10:12:06] [DEBUG/client-protocol.c:2605/client_protocol_cleanup()] > protocol/client:cleaning up state in transport object 0x8077418 > [May 25 10:12:26] [DEBUG/tcp-client.c:310/tcp_connect()] > tcp/client:connection on 4 success, attempting to handshake > [May 25 10:12:26] [DEBUG/tcp-client.c:54/do_handshake()] > transport/tcp-client:dictionary length = 50 > [May 25 10:12:26] [CRITICAL/client-protocol.c:218/call_bail()] > client/protocol:bailing transport > [May 25 10:12:26] [DEBUG/tcp.c:123/cont_hand()] tcp:forcing poll/read/write > to break on blocked socket (if any) > [May 25 10:13:11] [CRITICAL/client-protocol.c:218/call_bail()] > client/protocol:bailing transport > [May 25 10:13:11] [DEBUG/tcp.c:123/cont_hand()] tcp:forcing poll/read/write > to break on blocked socket (if any) > [May 25 10:13:11] [ERROR/common-utils.c:55/full_rw()] libglusterfs:full_rw: 0 > bytes r/w instead of 113 (errno=115) > [May 25 10:13:11] [DEBUG/protocol.c:244/gf_block_unserialize_transport()] > libglusterfs/protocol:gf_block_unserialize_transport: full_read of header > failed > [May 25 10:13:11] [DEBUG/client-protocol.c:2605/client_protocol_cleanup()] > protocol/client:cleaning up state in transport object 0x8077418 > [May 25 10:13:11] [CRITICAL/tcp.c:81/tcp_disconnect()] transport/tcp:client1: > connection to server disconnected > [May 25 10:13:11] [DEBUG/tcp-client.c:180/tcp_connect()] transport: tcp: > :try_connect: socket fd = 4 > [May 25 10:13:11] [DEBUG/tcp-client.c:202/tcp_connect()] transport: tcp: > :try_connect: finalized on port `1022' > [May 25 10:13:11] [DEBUG/tcp-client.c:226/tcp_connect()] > tcp/client:try_connect: defaulting remote-port to 6996 > [May 25 10:13:11] [DEBUG/tcp-client.c:262/tcp_connect()] tcp/client:connect > on 4 in progress (non-blocking) > [May 25 10:13:11] [DEBUG/tcp-client.c:301/tcp_connect()] > tcp/client:connection on 4 still in progress - try later > ... > > Both clients are running with read-ahead and write_behind on and > stat_prefetch off. All machines are on current mainline-2.4 as of patch 174. > Since both bricks are seeing the same error does that mean the client is > causing the error? The errors do not alway start at the same file but always > happen after a lot of files are processed. > > Are there any specific tests you would like run? > > Best, > > Harris > > > > > > _______________________________________________ > Gluster-devel mailing list > [email protected] > http://lists.nongnu.org/mailman/listinfo/gluster-devel > -- Anand V. Avati _______________________________________________ Gluster-devel mailing list [email protected] http://lists.nongnu.org/mailman/listinfo/gluster-devel
