[
https://issues.apache.org/jira/browse/HADOOP-10037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368809#comment-14368809
]
Steve Loughran commented on HADOOP-10037:
-----------------------------------------
Searching for the string {{Premature end of Content-Length delimited message
body}} brings up [a stack overflow post|
http://stackoverflow.com/questions/9952815/s3-java-client-fails-a-lot-with-premature-end-of-content-length-delimited-mess]
blaming the exception message on a GC of the s3 connection client.
Looking at the handler code, it was meant to fix the operation by-reopening the
connection. But an optimisation in Hadoop 2.4 (also needed to fix another
problem), turned seek(getPos()) to a no-op. Some other way of explicitly
re-opening the connection is going to be needed.
For now, try using s3a:// as the URL to the data. It has different issues in
Hadoop 2.6, but by Hadoop 2.7 should be ready to replace s3n completely
> s3n read truncated, but doesn't throw exception
> ------------------------------------------------
>
> Key: HADOOP-10037
> URL: https://issues.apache.org/jira/browse/HADOOP-10037
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs/s3
> Affects Versions: 2.0.0-alpha
> Environment: Ubuntu Linux 13.04 running on Amazon EC2 (cc2.8xlarge)
> Reporter: David Rosenstrauch
> Fix For: 2.6.0
>
> Attachments: S3ReadFailedOnTruncation.html, S3ReadSucceeded.html
>
>
> For months now we've been finding that we've been experiencing frequent data
> truncation issues when reading from S3 using the s3n:// protocol. I finally
> was able to gather some debugging output on the issue in a job I ran last
> night, and so can finally file a bug report.
> The job I ran last night was on a 16-node cluster (all of them AWS EC2
> cc2.8xlarge machines, running Ubuntu 13.04 and Cloudera CDH4.3.0). The job
> was a Hadoop streaming job, which reads through a large number (i.e.,
> ~55,000) of files on S3, each of them approximately 300K bytes in size.
> All of the files contain 46 columns of data in each record. But I added in
> an extra check in my mapper code to count and verify the number of columns in
> every record - throwing an error and crashing the map task if the column
> count is wrong.
> If you look in the attached task logs, you'll see 2 attempts on the same
> task. The first one fails due to data truncated (i.e., my job intentionally
> fails the map task due to the current record failing the column count check).
> The task then gets retried on a different machine and runs to a succesful
> completion.
> You can see further evidence of the truncation further down in the task logs,
> where it displays the count of the records read: the failed task says 32953
> records read, while the successful task says 63133.
> Any idea what the problem might be here and/or how to work around it? This
> issue is a very common occurrence on our clusters. E.g., in the job I ran
> last night before I had gone to bed I had already encountered 8 such
> failuers, and the job was only 10% complete. (~25,000 out of ~250,000 tasks.)
> I realize that it's common for I/O errors to occur - possibly even frequently
> - in a large Hadoop job. But I would think that if an I/O failure (like a
> truncated read) did occur, that something in the underlying infrastructure
> code (i.e., either in NativeS3FileSystem or in jets3t) should detect the
> error and throw an IOException accordingly. It shouldn't be up to the
> calling code to detect such failures, IMO.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)