[ 
https://issues.apache.org/jira/browse/HADOOP-10037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968425#comment-13968425
 ] 

David Rosenstrauch commented on HADOOP-10037:
---------------------------------------------

FYI, I recently upgraded our clusters (from CDH 4.3.0 / Hadoop to ) and it 
looks like this issue might now be solved.  I'm seeing some of the tasks of our 
Hadoop jobs (failing) as they should with the following wrong-#-of-byes-read 
exception, which then forces a re-try of the task.

{code}
org.apache.http.ConnectionClosedException: Premature end of Content-Length 
delimited message body (expected: 346403598; received: 15815108
        at 
org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:184)
        at 
org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:204)
        at 
org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:108)
        at 
org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:164)
        at 
org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:237)
        at 
org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:186)
        at org.apache.http.util.EntityUtils.consume(EntityUtils.java:87)
        at 
org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.releaseConnection(HttpMethodReleaseInputStream.java:102)
        at 
org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.close(HttpMethodReleaseInputStream.java:194)
        at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.seek(NativeS3FileSystem.java:152)
        at 
org.apache.hadoop.fs.BufferedFSInputStream.seek(BufferedFSInputStream.java:89)
        at 
org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:63)
        at 
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:102)
        at 
com.macrosense.mapreduce.io.PingRecordReader.initialize(PingRecordReader.java:80)
        at 
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:478)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:671)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
        at org.apache.hadoop.mapred.Child.main(Child.java:262)
{code}

Looks like this fix (in ContentLengthInputStream and/or EofSensorInputStream) 
was added to Apache HTTP Compoents and/or jets3t some time in the past few 
months

> s3n read truncated, but doesn't throw exception 
> ------------------------------------------------
>
>                 Key: HADOOP-10037
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10037
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 2.0.0-alpha
>         Environment: Ubuntu Linux 13.04 running on Amazon EC2 (cc2.8xlarge)
>            Reporter: David Rosenstrauch
>         Attachments: S3ReadFailedOnTruncation.html, S3ReadSucceeded.html
>
>
> For months now we've been finding that we've been experiencing frequent data 
> truncation issues when reading from S3 using the s3n:// protocol.  I finally 
> was able to gather some debugging output on the issue in a job I ran last 
> night, and so can finally file a bug report.
> The job I ran last night was on a 16-node cluster (all of them AWS EC2 
> cc2.8xlarge machines, running Ubuntu 13.04 and Cloudera CDH4.3.0).  The job 
> was a Hadoop streaming job, which reads through a large number (i.e., 
> ~55,000) of files on S3, each of them approximately 300K bytes in size.
> All of the files contain 46 columns of data in each record.  But I added in 
> an extra check in my mapper code to count and verify the number of columns in 
> every record - throwing an error and crashing the map task if the column 
> count is wrong.
> If you look in the attached task logs, you'll see 2 attempts on the same 
> task.  The first one fails due to data truncated (i.e., my job intentionally 
> fails the map task due to the current record failing the column count check). 
>  The task then gets retried on a different machine and runs to a succesful 
> completion.
> You can see further evidence of the truncation further down in the task logs, 
> where it displays the count of the records read:  the failed task says 32953 
> records read, while the successful task says 63133.
> Any idea what the problem might be here and/or how to work around it?  This 
> issue is a very common occurrence on our clusters.  E.g., in the job I ran 
> last night before I had gone to bed I had already encountered 8 such 
> failuers, and the job was only 10% complete.  (~25,000 out of ~250,000 tasks.)
> I realize that it's common for I/O errors to occur - possibly even frequently 
> - in a large Hadoop job.  But I would think that if an I/O failure (like a 
> truncated read) did occur, that something in the underlying infrastructure 
> code (i.e., either in NativeS3FileSystem or in jets3t) should detect the 
> error and throw an IOException accordingly.  It shouldn't be up to the 
> calling code to detect such failures, IMO.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to