[
https://issues.apache.org/jira/browse/HADOOP-10037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968425#comment-13968425
]
David Rosenstrauch commented on HADOOP-10037:
---------------------------------------------
FYI, I recently upgraded our clusters (from CDH 4.3.0 / Hadoop to ) and it
looks like this issue might now be solved. I'm seeing some of the tasks of our
Hadoop jobs (failing) as they should with the following wrong-#-of-byes-read
exception, which then forces a re-try of the task.
{code}
org.apache.http.ConnectionClosedException: Premature end of Content-Length
delimited message body (expected: 346403598; received: 15815108
at
org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:184)
at
org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:204)
at
org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:108)
at
org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:164)
at
org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:237)
at
org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:186)
at org.apache.http.util.EntityUtils.consume(EntityUtils.java:87)
at
org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.releaseConnection(HttpMethodReleaseInputStream.java:102)
at
org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.close(HttpMethodReleaseInputStream.java:194)
at
org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.seek(NativeS3FileSystem.java:152)
at
org.apache.hadoop.fs.BufferedFSInputStream.seek(BufferedFSInputStream.java:89)
at
org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:63)
at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:102)
at
com.macrosense.mapreduce.io.PingRecordReader.initialize(PingRecordReader.java:80)
at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:478)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:671)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
{code}
Looks like this fix (in ContentLengthInputStream and/or EofSensorInputStream)
was added to Apache HTTP Compoents and/or jets3t some time in the past few
months
> s3n read truncated, but doesn't throw exception
> ------------------------------------------------
>
> Key: HADOOP-10037
> URL: https://issues.apache.org/jira/browse/HADOOP-10037
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs/s3
> Affects Versions: 2.0.0-alpha
> Environment: Ubuntu Linux 13.04 running on Amazon EC2 (cc2.8xlarge)
> Reporter: David Rosenstrauch
> Attachments: S3ReadFailedOnTruncation.html, S3ReadSucceeded.html
>
>
> For months now we've been finding that we've been experiencing frequent data
> truncation issues when reading from S3 using the s3n:// protocol. I finally
> was able to gather some debugging output on the issue in a job I ran last
> night, and so can finally file a bug report.
> The job I ran last night was on a 16-node cluster (all of them AWS EC2
> cc2.8xlarge machines, running Ubuntu 13.04 and Cloudera CDH4.3.0). The job
> was a Hadoop streaming job, which reads through a large number (i.e.,
> ~55,000) of files on S3, each of them approximately 300K bytes in size.
> All of the files contain 46 columns of data in each record. But I added in
> an extra check in my mapper code to count and verify the number of columns in
> every record - throwing an error and crashing the map task if the column
> count is wrong.
> If you look in the attached task logs, you'll see 2 attempts on the same
> task. The first one fails due to data truncated (i.e., my job intentionally
> fails the map task due to the current record failing the column count check).
> The task then gets retried on a different machine and runs to a succesful
> completion.
> You can see further evidence of the truncation further down in the task logs,
> where it displays the count of the records read: the failed task says 32953
> records read, while the successful task says 63133.
> Any idea what the problem might be here and/or how to work around it? This
> issue is a very common occurrence on our clusters. E.g., in the job I ran
> last night before I had gone to bed I had already encountered 8 such
> failuers, and the job was only 10% complete. (~25,000 out of ~250,000 tasks.)
> I realize that it's common for I/O errors to occur - possibly even frequently
> - in a large Hadoop job. But I would think that if an I/O failure (like a
> truncated read) did occur, that something in the underlying infrastructure
> code (i.e., either in NativeS3FileSystem or in jets3t) should detect the
> error and throw an IOException accordingly. It shouldn't be up to the
> calling code to detect such failures, IMO.
--
This message was sent by Atlassian JIRA
(v6.2#6252)