I wonder if this is happening with replication tests because something in the replication code specifically is failing to close connections and/or return them to the pool. It would probably be somewhere in the drain() code, because that's where I saw these tests stuck most often.
On Tue, Aug 30, 2016 at 6:18 PM Marc P. <[email protected]> wrote: > Yah I saw this a lot when I wasn't closing thrift connections...but also > saw it when the client would close prematurely and not return the transport > to the thrift transport pool . > > In one case I hadn't finished with the work in a thread but kept opening > thrift connections since it would be 'time sliced' for io. In that case I > opened too many sockets ( fds )...maybe hitting max open files because a > transport isn't being returned in the middle of a work unit ? > > On Tue, Aug 30, 2016, 6:12 PM Christopher <[email protected]> wrote: > > > Thrift is not happy on some replication ITs I've run lately. I had one > test > > timeout after 40 minutes... and it never finished. The symptom is lots of > > client side messages about failure to open transport, and the server side > > messages were (and both were occurring a *lot*, indicating indefinite > > retries): > > > > 2016-08-30 19:48:13,476 [rpc.CustomNonBlockingServer$CustomFrameBuffer] > > WARN : Got an IOException in internalRead! > > java.io.IOException: Connection reset by peer > > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > > at sun.nio.ch.IOUtil.read(IOUtil.java:197) > > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384) > > at > > > > > org.apache.thrift.transport.TNonblockingSocket.read(TNonblockingSocket.java:142) > > at > > > > > org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.internalRead(AbstractNonblockingServer.java:539) > > at > > > > > org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.read(AbstractNonblockingServer.java:338) > > at > > > > > org.apache.thrift.server.AbstractNonblockingServer$AbstractSelectThread.handleRead(AbstractNonblockingServer.java:203) > > at > > > > > org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.select(TNonblockingServer.java:203) > > at > > > > > org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.run(TNonblockingServer.java:154) > > > > I saw one comment on a mailing list somewhere that indicated this might > be > > caused by a client side handling of a custom Thrift Exception, not > properly > > closing the connection. It's possible we're doing something badly before > we > > retry. I think more investigation is needed before I file a JIRA (not > even > > sure what to file it against, right now... because I'm not sure what > > component is even at fault). > > > > In the meantime, has anybody seen this? Does anybody have any insight > into > > this? This is all on a single node, running ITs. There really shouldn't > be > > any "network" problems which would cause a TCP reset from external to the > > test and Accumulo itself, since it's all localhost. > > >
