Unhandled failures starting jobs with S3 as backing store
---------------------------------------------------------

                 Key: HADOOP-4637
                 URL: https://issues.apache.org/jira/browse/HADOOP-4637
             Project: Hadoop Core
          Issue Type: Bug
          Components: fs/s3
    Affects Versions: 0.18.1
            Reporter: Robert


I run Hadoop 0.18.1 on Amazon EC2, with S3 as the backing store.

When starting jobs, I sometimes get the following failure, which causes the job 
to be abandoned:

org.apache.hadoop.ipc.RemoteException: java.io.IOException: 
java.lang.NullPointerException
        at 
org.apache.hadoop.fs.s3.Jets3tFileSystemStore.retrieveBlock(Jets3tFileSystemStore.java:222)
        at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        at $Proxy4.retrieveBlock(Unknown Source)
        at 
org.apache.hadoop.fs.s3.S3InputStream.blockSeekTo(S3InputStream.java:160)
        at org.apache.hadoop.fs.s3.S3InputStream.read(S3InputStream.java:119)
        at java.io.DataInputStream.read(DataInputStream.java:83)
        at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:47)
        at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:214)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:150)
        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1212)
        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1193)
        at org.apache.hadoop.mapred.JobInProgress.<init>(JobInProgress.java:177)
        at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:1783)
        at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
        at org.apache.hadoop.ipc.Client.call(Client.java:715)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
        at org.apache.hadoop.mapred.$Proxy5.submitJob(Unknown Source)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:788)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1026)

The stack trace suggests that copying the job file fails, because the HDFS S3 
filesystem can't find all of the expected block objects when it needs them.

Since S3 is an "eventually consistent" kind of a filesystem, and does not 
always provide an up-to-date view of the stored data, this execution path 
probably should be strengthened - at least to retry these failed operations, or 
wait for the expected block file if it hasn't shown up yet. 



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to