[ 
https://issues.apache.org/jira/browse/ACCUMULO-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13913477#comment-13913477
 ] 

Mike Drob commented on ACCUMULO-2388:
-------------------------------------

Started looking more into the time-out issues, and discovered that I'm not 
setting {{--batchTimeout}} and it is defaulting to {{Long.MAX_VALUE}}. So the 
client is _always_ going to still be waiting and the tablet server will never 
be able to safely abort the operation. At least, I hope that I'm reading these 
properties correctly.

The longest hold time that I see in the logs is 170s, and my 
general.rpc.timeout is 120s, so the {{HoldTimeoutException}} makes sense, from 
a sanity check point of view. Doubling the wait time on the tserver would 
likely solve the issue here, yes, but I'm worried that's not a stable approach.

The question that remains is how best to prevent ingest clients from dying, 
now. There may be different short term and long term solutions that are 
appropriate.
Short term: Let ingest retry the failed mutations. If the tablet server rejects 
the mutation for whatever reason, and the client is aware of it, then this 
wouldn't constitute data loss.
Long term: Provide an API for clients to specify their wait time to the 
servers. Then, tablet servers can more intelligently decide when to abort scans 
and when not to. Would need to protect the cluster from several poorly 
configured clients asking for very long hold times. Possibly other availability 
issues would be present as well.

Thoughts?

> Continuous Ingest clients die
> -----------------------------
>
>                 Key: ACCUMULO-2388
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2388
>             Project: Accumulo
>          Issue Type: Bug
>          Components: test, tserver
>         Environment: 1.6.0-SNAPSHOT (sha-1: 0da9a56)
> cdh4.5.0
>            Reporter: Mike Drob
>              Labels: 16_qa_bug
>             Fix For: 1.6.0
>
>         Attachments: tracer.debug.log, tserver1.log
>
>
> I was running continuous ingest on a 7 node cluster (5 slaves) and after 
> enabling HDFS agitation, my clients died.
> {code:title=ingest.err}
> Thread "org.apache.accumulo.test.continuous.ContinuousIngest" died 
> java.lang.reflect.InvocationTargetException
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.accumulo.start.Main$1.run(Main.java:137)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: java.lang.reflect.UndeclaredThrowableException
> at $Proxy9.addMutation(Unknown Source)
> at 
> org.apache.accumulo.test.continuous.ContinuousIngest.main(ContinuousIngest.java:212)
> ... 6 more
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at 
> org.apache.accumulo.trace.instrument.TraceProxy$2.invoke(TraceProxy.java:43)
> ... 8 more
> Caused by: org.apache.accumulo.core.client.MutationsRejectedException: # 
> constraint violations : 0 security codes: {} # server errors 1 # exceptions 0
> at 
> org.apache.accumulo.core.client.impl.TabletServerBatchWriter.checkForFailures(TabletServerBatchWriter.java:537)
> at 
> org.apache.accumulo.core.client.impl.TabletServerBatchWriter.addMutation(TabletServerBatchWriter.java:258)
> at 
> org.apache.accumulo.core.client.impl.BatchWriterImpl.addMutation(BatchWriterImpl.java:43)
> ... 12 more
> {code}
> {code:title=ingest.out}
> UUID 1392844086463 f822a6a9-9592-4b3a-ab3b-1c172be20b96
> FLUSH 1392844135523 49047 6165 1000000 1000000
> FLUSH 1392844165594 30071 7787 2000000 1000000
> FLUSH 1392844195875 30281 7816 3000000 1000000
> FLUSH 1392844226787 30912 8086 4000000 1000000
> FLUSH 1392844257194 30407 7989 5000000 1000000
> FLUSH 1392844287518 30324 7743 6000000 1000000
> FLUSH 1392844325833 38315 10933 7000000 1000000
> FLUSH 1392844364708 38875 7916 8000000 1000000
> FLUSH 1392844395818 31110 8104 9000000 1000000
> 2014-02-19 13:16:57,444 [impl.TabletServerBatchWriter] ERROR: Server side 
> error on tserver1:10011: org.apache.thrift.TApplicationException: Internal 
> error processing applyUpdates
> 2014-02-19 13:16:57,446 [impl.TabletServerBatchWriter] ERROR: Failed to send 
> tablet server tserver1:10011 its batch : Error on server tserver1:10011
> org.apache.accumulo.core.client.impl.AccumuloServerException: Error on server 
> tserver1:10011
> at 
> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer(TabletServerBatchWriter.java:937)
> at 
> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.access$1600(TabletServerBatchWriter.java:616)
> at 
> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter$SendTask.send(TabletServerBatchWriter.java:801)
> at 
> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter$SendTask.run(TabletServerBatchWriter.java:765)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at 
> org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
> at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: org.apache.thrift.TApplicationException: Internal error processing 
> applyUpdates
> at 
> org.apache.thrift.TApplicationException.read(TApplicationException.java:108)
> at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71)
> at 
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_closeUpdate(TabletClientService.java:431)
> at 
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.closeUpdate(TabletClientService.java:417)
> at 
> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer(TabletServerBatchWriter.java:899)
> ... 11 more
> {code}
> {code:title=tserver.log}
> 2014-02-19 13:16:56,156 [util.TServerUtils$THsHaServer] WARN : Got an 
> IOException in internalRead!
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcher.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
> at sun.nio.ch.IOUtil.read(IOUtil.java:171)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
> at 
> org.apache.thrift.transport.TNonblockingSocket.read(TNonblockingSocket.java:141)
> at 
> org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.internalRead(AbstractNonblockingServer.java:515)
> at 
> org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.read(AbstractNonblockingServer.java:355)
> at 
> org.apache.thrift.server.AbstractNonblockingServer$AbstractSelectThread.handleRead(AbstractNonblockingServer.java:202)
> at 
> org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.select(TNonblockingServer.java:198)
> at 
> org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.run(TNonblockingServer.java:154)
> {code}
> Note that this last message was not propagated to the monitor for some 
> reason, but that is likely a different issue. (I had been seeing other WARN 
> messages show up earlier.)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to