[jira] [Commented] (ACCUMULO-2227) Concurrent randomwalk fails when namenode dies after bulk import step

Bill Havanki (JIRA) Wed, 22 Jan 2014 07:27:24 -0800

    [ 
https://issues.apache.org/jira/browse/ACCUMULO-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13878758#comment-13878758
 ]


Bill Havanki commented on ACCUMULO-2227:
----------------------------------------

The failure here is ultimately due to running the test under Hadoop 2.0.0. Up 
until then, client-to-namenode calls like {{delete}}, annotated as 
{{AtMostOnce}} in {{org.apache.hadoop.hdfs.protocol.ClientProtocol}}, were not 
retried; only operations marked {{Idempotent}} were. Starting with the 
implementation of HADOOP-9792 in Hadoop 2.1.0, {{AtMostOnce}}-annotated 
operations are also retried. So, I expect that upgrading my cluster to Hadoop 
2.1.0 or higher would resolve this issue.

The {{mkdirs}} call is annotated as {{Idempotent}} so it should not cause this 
problem, even under Hadoop 2.0.0.

I'm not sure that adding an _ad hoc_ retry here is the best idea to resolve 
this, so any opinions are welcome.

> Concurrent randomwalk fails when namenode dies after bulk import step
> ---------------------------------------------------------------------
>
>                 Key: ACCUMULO-2227
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2227
>             Project: Accumulo
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 1.4.4
>            Reporter: Bill Havanki
>              Labels: ha, randomwalk, test
>
> Running Concurrent randomwalk under HDFS HA, if the active namenode is killed:
> {noformat}
> 20 12:27:51,119 [retry.RetryInvocationHandler] WARN : Exception while 
> invoking class 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.delete. 
> Not retrying because the invoked method is not idempotent, and unable to 
> determine whether it was invoked
> java.io.IOException: Failed on local exception: java.io.IOException: Response 
> is null.; Host Details : local host is: "slave.domain.com/10.20.200.113"; 
> destination host is: "namenode.domain.com":8020;
> ...
>  at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:1487)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:355)
> at 
> org.apache.accumulo.server.test.randomwalk.concurrent.BulkImport.visit(BulkImport.java:140)
> ...
> Caused by: java.io.IOException: Response is null.
> at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:952)
> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:847)
> {noformat}
> This arises from an HDFS path delete call that cleans up from the bulk 
> import. The test should be resilient here (and when the paths are made 
> earlier in the test) so that the test can continue once failover has 
> completed.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (ACCUMULO-2227) Concurrent randomwalk fails when namenode dies after bulk import step

Reply via email to