[jira] [Commented] (HADOOP-10037) s3n read truncated, but doesn't throw exception

2015-03-19 Thread Takenori Sato (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14370292#comment-14370292
 ] 

Takenori Sato commented on HADOOP-10037:


David, thanks for your clarification.

I heard from Steve that my issue was introduced by some optimizations done for 
2.4.

So let me close this as FIXED. I will create a new issue for mine.

> s3n read truncated, but doesn't throw exception 
> 
>
> Key: HADOOP-10037
> URL: https://issues.apache.org/jira/browse/HADOOP-10037
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs/s3
>Affects Versions: 2.0.0-alpha
> Environment: Ubuntu Linux 13.04 running on Amazon EC2 (cc2.8xlarge)
>Reporter: David Rosenstrauch
> Fix For: 2.6.0
>
> Attachments: S3ReadFailedOnTruncation.html, S3ReadSucceeded.html
>
>
> For months now we've been finding that we've been experiencing frequent data 
> truncation issues when reading from S3 using the s3n:// protocol.  I finally 
> was able to gather some debugging output on the issue in a job I ran last 
> night, and so can finally file a bug report.
> The job I ran last night was on a 16-node cluster (all of them AWS EC2 
> cc2.8xlarge machines, running Ubuntu 13.04 and Cloudera CDH4.3.0).  The job 
> was a Hadoop streaming job, which reads through a large number (i.e., 
> ~55,000) of files on S3, each of them approximately 300K bytes in size.
> All of the files contain 46 columns of data in each record.  But I added in 
> an extra check in my mapper code to count and verify the number of columns in 
> every record - throwing an error and crashing the map task if the column 
> count is wrong.
> If you look in the attached task logs, you'll see 2 attempts on the same 
> task.  The first one fails due to data truncated (i.e., my job intentionally 
> fails the map task due to the current record failing the column count check). 
>  The task then gets retried on a different machine and runs to a succesful 
> completion.
> You can see further evidence of the truncation further down in the task logs, 
> where it displays the count of the records read:  the failed task says 32953 
> records read, while the successful task says 63133.
> Any idea what the problem might be here and/or how to work around it?  This 
> issue is a very common occurrence on our clusters.  E.g., in the job I ran 
> last night before I had gone to bed I had already encountered 8 such 
> failuers, and the job was only 10% complete.  (~25,000 out of ~250,000 tasks.)
> I realize that it's common for I/O errors to occur - possibly even frequently 
> - in a large Hadoop job.  But I would think that if an I/O failure (like a 
> truncated read) did occur, that something in the underlying infrastructure 
> code (i.e., either in NativeS3FileSystem or in jets3t) should detect the 
> error and throw an IOException accordingly.  It shouldn't be up to the 
> calling code to detect such failures, IMO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10037) s3n read truncated, but doesn't throw exception

2015-03-19 Thread David Rosenstrauch (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14369350#comment-14369350
 ] 

David Rosenstrauch commented on HADOOP-10037:
-

I think what Takenori Sato is describing is a separate issue than the one I 
originally reported.  The issue I reported was solved long ago by the 
"ConnectionClosedException: Premature end of Content-Length ..." exception.  
Prior to that fix no exception was thrown if the socket didn't successfully 
read all the data - it would just return the incomplete data.

> s3n read truncated, but doesn't throw exception 
> 
>
> Key: HADOOP-10037
> URL: https://issues.apache.org/jira/browse/HADOOP-10037
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs/s3
>Affects Versions: 2.0.0-alpha
> Environment: Ubuntu Linux 13.04 running on Amazon EC2 (cc2.8xlarge)
>Reporter: David Rosenstrauch
> Fix For: 2.6.0
>
> Attachments: S3ReadFailedOnTruncation.html, S3ReadSucceeded.html
>
>
> For months now we've been finding that we've been experiencing frequent data 
> truncation issues when reading from S3 using the s3n:// protocol.  I finally 
> was able to gather some debugging output on the issue in a job I ran last 
> night, and so can finally file a bug report.
> The job I ran last night was on a 16-node cluster (all of them AWS EC2 
> cc2.8xlarge machines, running Ubuntu 13.04 and Cloudera CDH4.3.0).  The job 
> was a Hadoop streaming job, which reads through a large number (i.e., 
> ~55,000) of files on S3, each of them approximately 300K bytes in size.
> All of the files contain 46 columns of data in each record.  But I added in 
> an extra check in my mapper code to count and verify the number of columns in 
> every record - throwing an error and crashing the map task if the column 
> count is wrong.
> If you look in the attached task logs, you'll see 2 attempts on the same 
> task.  The first one fails due to data truncated (i.e., my job intentionally 
> fails the map task due to the current record failing the column count check). 
>  The task then gets retried on a different machine and runs to a succesful 
> completion.
> You can see further evidence of the truncation further down in the task logs, 
> where it displays the count of the records read:  the failed task says 32953 
> records read, while the successful task says 63133.
> Any idea what the problem might be here and/or how to work around it?  This 
> issue is a very common occurrence on our clusters.  E.g., in the job I ran 
> last night before I had gone to bed I had already encountered 8 such 
> failuers, and the job was only 10% complete.  (~25,000 out of ~250,000 tasks.)
> I realize that it's common for I/O errors to occur - possibly even frequently 
> - in a large Hadoop job.  But I would think that if an I/O failure (like a 
> truncated read) did occur, that something in the underlying infrastructure 
> code (i.e., either in NativeS3FileSystem or in jets3t) should detect the 
> error and throw an IOException accordingly.  It shouldn't be up to the 
> calling code to detect such failures, IMO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10037) s3n read truncated, but doesn't throw exception

2015-03-19 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368809#comment-14368809
 ] 

Steve Loughran commented on HADOOP-10037:
-

Searching for the string {{Premature end of Content-Length delimited message 
body}} brings up [a stack overflow post|
http://stackoverflow.com/questions/9952815/s3-java-client-fails-a-lot-with-premature-end-of-content-length-delimited-mess]
 blaming the exception message on a GC of the s3 connection client.

Looking at the handler code, it was meant to fix the operation by-reopening the 
connection. But an optimisation in Hadoop 2.4 (also needed to fix another 
problem), turned seek(getPos()) to a no-op. Some other way of explicitly 
re-opening the connection is going to be needed.

For now, try using s3a:// as the URL to the data. It has different issues in 
Hadoop 2.6, but by Hadoop 2.7 should be ready to replace s3n completely

> s3n read truncated, but doesn't throw exception 
> 
>
> Key: HADOOP-10037
> URL: https://issues.apache.org/jira/browse/HADOOP-10037
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs/s3
>Affects Versions: 2.0.0-alpha
> Environment: Ubuntu Linux 13.04 running on Amazon EC2 (cc2.8xlarge)
>Reporter: David Rosenstrauch
> Fix For: 2.6.0
>
> Attachments: S3ReadFailedOnTruncation.html, S3ReadSucceeded.html
>
>
> For months now we've been finding that we've been experiencing frequent data 
> truncation issues when reading from S3 using the s3n:// protocol.  I finally 
> was able to gather some debugging output on the issue in a job I ran last 
> night, and so can finally file a bug report.
> The job I ran last night was on a 16-node cluster (all of them AWS EC2 
> cc2.8xlarge machines, running Ubuntu 13.04 and Cloudera CDH4.3.0).  The job 
> was a Hadoop streaming job, which reads through a large number (i.e., 
> ~55,000) of files on S3, each of them approximately 300K bytes in size.
> All of the files contain 46 columns of data in each record.  But I added in 
> an extra check in my mapper code to count and verify the number of columns in 
> every record - throwing an error and crashing the map task if the column 
> count is wrong.
> If you look in the attached task logs, you'll see 2 attempts on the same 
> task.  The first one fails due to data truncated (i.e., my job intentionally 
> fails the map task due to the current record failing the column count check). 
>  The task then gets retried on a different machine and runs to a succesful 
> completion.
> You can see further evidence of the truncation further down in the task logs, 
> where it displays the count of the records read:  the failed task says 32953 
> records read, while the successful task says 63133.
> Any idea what the problem might be here and/or how to work around it?  This 
> issue is a very common occurrence on our clusters.  E.g., in the job I ran 
> last night before I had gone to bed I had already encountered 8 such 
> failuers, and the job was only 10% complete.  (~25,000 out of ~250,000 tasks.)
> I realize that it's common for I/O errors to occur - possibly even frequently 
> - in a large Hadoop job.  But I would think that if an I/O failure (like a 
> truncated read) did occur, that something in the underlying infrastructure 
> code (i.e., either in NativeS3FileSystem or in jets3t) should detect the 
> error and throw an IOException accordingly.  It shouldn't be up to the 
> calling code to detect such failures, IMO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10037) s3n read truncated, but doesn't throw exception

2014-04-14 Thread David Rosenstrauch (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968427#comment-13968427
 ] 

David Rosenstrauch commented on HADOOP-10037:
-

FYI, I recently upgraded our clusters (from CDH 4.3.0 to CDH5.0.0) and it looks 
like this issue might now be solved. I'm seeing some of the tasks of our Hadoop 
jobs (failing) as they should with the following wrong-#-of-byes-read 
exception, which then forces a re-try of the task.

{code}
org.apache.http.ConnectionClosedException: Premature end of Content-Length 
delimited message body (expected: 346403598; received: 15815108
at 
org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:184)
at 
org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:204)
at 
org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:108)
at 
org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:164)
at 
org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:237)
at 
org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:186)
at org.apache.http.util.EntityUtils.consume(EntityUtils.java:87)
at 
org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.releaseConnection(HttpMethodReleaseInputStream.java:102)
at 
org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.close(HttpMethodReleaseInputStream.java:194)
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.seek(NativeS3FileSystem.java:152)
at 
org.apache.hadoop.fs.BufferedFSInputStream.seek(BufferedFSInputStream.java:89)
at 
org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:63)
at 
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:102)
at 
com.macrosense.mapreduce.io.PingRecordReader.initialize(PingRecordReader.java:80)
at 
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:478)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:671)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
{code}

Looks like this fix (in ContentLengthInputStream and/or EofSensorInputStream) 
was added to Apache HTTP Compoents and/or jets3t some time in the past few 
months

> s3n read truncated, but doesn't throw exception 
> 
>
> Key: HADOOP-10037
> URL: https://issues.apache.org/jira/browse/HADOOP-10037
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs/s3
>Affects Versions: 2.0.0-alpha
> Environment: Ubuntu Linux 13.04 running on Amazon EC2 (cc2.8xlarge)
>Reporter: David Rosenstrauch
> Attachments: S3ReadFailedOnTruncation.html, S3ReadSucceeded.html
>
>
> For months now we've been finding that we've been experiencing frequent data 
> truncation issues when reading from S3 using the s3n:// protocol.  I finally 
> was able to gather some debugging output on the issue in a job I ran last 
> night, and so can finally file a bug report.
> The job I ran last night was on a 16-node cluster (all of them AWS EC2 
> cc2.8xlarge machines, running Ubuntu 13.04 and Cloudera CDH4.3.0).  The job 
> was a Hadoop streaming job, which reads through a large number (i.e., 
> ~55,000) of files on S3, each of them approximately 300K bytes in size.
> All of the files contain 46 columns of data in each record.  But I added in 
> an extra check in my mapper code to count and verify the number of columns in 
> every record - throwing an error and crashing the map task if the column 
> count is wrong.
> If you look in the attached task logs, you'll see 2 attempts on the same 
> task.  The first one fails due to data truncated (i.e., my job intentionally 
> fails the map task due to the current record failing the column count check). 
>  The task then gets retried on a different machine and runs to a succesful 
> completion.
> You can see further evidence of the truncation further down in the task logs, 
> where it displays the count of the records read:  the failed task says 32953 
> records read, while the successful task says 63133.
> Any idea what the problem might be here and/or how to work around it?  This 
> issue is a very common occurrence on our clusters.  E.g., in the job I ran 
> last night before I had gone to bed I

[jira] [Commented] (HADOOP-10037) s3n read truncated, but doesn't throw exception

2014-04-14 Thread David Rosenstrauch (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968425#comment-13968425
 ] 

David Rosenstrauch commented on HADOOP-10037:
-

FYI, I recently upgraded our clusters (from CDH 4.3.0 / Hadoop to ) and it 
looks like this issue might now be solved.  I'm seeing some of the tasks of our 
Hadoop jobs (failing) as they should with the following wrong-#-of-byes-read 
exception, which then forces a re-try of the task.

{code}
org.apache.http.ConnectionClosedException: Premature end of Content-Length 
delimited message body (expected: 346403598; received: 15815108
at 
org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:184)
at 
org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:204)
at 
org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:108)
at 
org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:164)
at 
org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:237)
at 
org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:186)
at org.apache.http.util.EntityUtils.consume(EntityUtils.java:87)
at 
org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.releaseConnection(HttpMethodReleaseInputStream.java:102)
at 
org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.close(HttpMethodReleaseInputStream.java:194)
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.seek(NativeS3FileSystem.java:152)
at 
org.apache.hadoop.fs.BufferedFSInputStream.seek(BufferedFSInputStream.java:89)
at 
org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:63)
at 
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:102)
at 
com.macrosense.mapreduce.io.PingRecordReader.initialize(PingRecordReader.java:80)
at 
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:478)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:671)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
{code}

Looks like this fix (in ContentLengthInputStream and/or EofSensorInputStream) 
was added to Apache HTTP Compoents and/or jets3t some time in the past few 
months

> s3n read truncated, but doesn't throw exception 
> 
>
> Key: HADOOP-10037
> URL: https://issues.apache.org/jira/browse/HADOOP-10037
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs/s3
>Affects Versions: 2.0.0-alpha
> Environment: Ubuntu Linux 13.04 running on Amazon EC2 (cc2.8xlarge)
>Reporter: David Rosenstrauch
> Attachments: S3ReadFailedOnTruncation.html, S3ReadSucceeded.html
>
>
> For months now we've been finding that we've been experiencing frequent data 
> truncation issues when reading from S3 using the s3n:// protocol.  I finally 
> was able to gather some debugging output on the issue in a job I ran last 
> night, and so can finally file a bug report.
> The job I ran last night was on a 16-node cluster (all of them AWS EC2 
> cc2.8xlarge machines, running Ubuntu 13.04 and Cloudera CDH4.3.0).  The job 
> was a Hadoop streaming job, which reads through a large number (i.e., 
> ~55,000) of files on S3, each of them approximately 300K bytes in size.
> All of the files contain 46 columns of data in each record.  But I added in 
> an extra check in my mapper code to count and verify the number of columns in 
> every record - throwing an error and crashing the map task if the column 
> count is wrong.
> If you look in the attached task logs, you'll see 2 attempts on the same 
> task.  The first one fails due to data truncated (i.e., my job intentionally 
> fails the map task due to the current record failing the column count check). 
>  The task then gets retried on a different machine and runs to a succesful 
> completion.
> You can see further evidence of the truncation further down in the task logs, 
> where it displays the count of the records read:  the failed task says 32953 
> records read, while the successful task says 63133.
> Any idea what the problem might be here and/or how to work around it?  This 
> issue is a very common occurrence on our clusters.  E.g., in the job I ran 
> last night before I had gone to bed