[jira] [Resolved] (HADOOP-10037) s3n read truncated, but doesn't throw exception

2015-03-19 Thread Takenori Sato (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takenori Sato resolved HADOOP-10037.

Resolution: Fixed

The issue that had reopened this turned out being a separate issue.

 s3n read truncated, but doesn't throw exception 
 

 Key: HADOOP-10037
 URL: https://issues.apache.org/jira/browse/HADOOP-10037
 Project: Hadoop Common
  Issue Type: Bug
  Components: fs/s3
Affects Versions: 2.0.0-alpha
 Environment: Ubuntu Linux 13.04 running on Amazon EC2 (cc2.8xlarge)
Reporter: David Rosenstrauch
 Fix For: 2.6.0

 Attachments: S3ReadFailedOnTruncation.html, S3ReadSucceeded.html


 For months now we've been finding that we've been experiencing frequent data 
 truncation issues when reading from S3 using the s3n:// protocol.  I finally 
 was able to gather some debugging output on the issue in a job I ran last 
 night, and so can finally file a bug report.
 The job I ran last night was on a 16-node cluster (all of them AWS EC2 
 cc2.8xlarge machines, running Ubuntu 13.04 and Cloudera CDH4.3.0).  The job 
 was a Hadoop streaming job, which reads through a large number (i.e., 
 ~55,000) of files on S3, each of them approximately 300K bytes in size.
 All of the files contain 46 columns of data in each record.  But I added in 
 an extra check in my mapper code to count and verify the number of columns in 
 every record - throwing an error and crashing the map task if the column 
 count is wrong.
 If you look in the attached task logs, you'll see 2 attempts on the same 
 task.  The first one fails due to data truncated (i.e., my job intentionally 
 fails the map task due to the current record failing the column count check). 
  The task then gets retried on a different machine and runs to a succesful 
 completion.
 You can see further evidence of the truncation further down in the task logs, 
 where it displays the count of the records read:  the failed task says 32953 
 records read, while the successful task says 63133.
 Any idea what the problem might be here and/or how to work around it?  This 
 issue is a very common occurrence on our clusters.  E.g., in the job I ran 
 last night before I had gone to bed I had already encountered 8 such 
 failuers, and the job was only 10% complete.  (~25,000 out of ~250,000 tasks.)
 I realize that it's common for I/O errors to occur - possibly even frequently 
 - in a large Hadoop job.  But I would think that if an I/O failure (like a 
 truncated read) did occur, that something in the underlying infrastructure 
 code (i.e., either in NativeS3FileSystem or in jets3t) should detect the 
 error and throw an IOException accordingly.  It shouldn't be up to the 
 calling code to detect such failures, IMO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HADOOP-10037) s3n read truncated, but doesn't throw exception

2015-01-16 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran resolved HADOOP-10037.
-
   Resolution: Cannot Reproduce
Fix Version/s: 2.6.0

closing as Cannot Reproduce, as it appears to have gone away for you.

# Hadoop 2.6 is using a much later version of jets3t
# Hadoop 2.6 also offers a (compatible) s3a fiesystem which uses the AWS SDK 
instead. 

If you do see this problem, try using s3a to see if it occurs there

 s3n read truncated, but doesn't throw exception 
 

 Key: HADOOP-10037
 URL: https://issues.apache.org/jira/browse/HADOOP-10037
 Project: Hadoop Common
  Issue Type: Bug
  Components: fs/s3
Affects Versions: 2.0.0-alpha
 Environment: Ubuntu Linux 13.04 running on Amazon EC2 (cc2.8xlarge)
Reporter: David Rosenstrauch
 Fix For: 2.6.0

 Attachments: S3ReadFailedOnTruncation.html, S3ReadSucceeded.html


 For months now we've been finding that we've been experiencing frequent data 
 truncation issues when reading from S3 using the s3n:// protocol.  I finally 
 was able to gather some debugging output on the issue in a job I ran last 
 night, and so can finally file a bug report.
 The job I ran last night was on a 16-node cluster (all of them AWS EC2 
 cc2.8xlarge machines, running Ubuntu 13.04 and Cloudera CDH4.3.0).  The job 
 was a Hadoop streaming job, which reads through a large number (i.e., 
 ~55,000) of files on S3, each of them approximately 300K bytes in size.
 All of the files contain 46 columns of data in each record.  But I added in 
 an extra check in my mapper code to count and verify the number of columns in 
 every record - throwing an error and crashing the map task if the column 
 count is wrong.
 If you look in the attached task logs, you'll see 2 attempts on the same 
 task.  The first one fails due to data truncated (i.e., my job intentionally 
 fails the map task due to the current record failing the column count check). 
  The task then gets retried on a different machine and runs to a succesful 
 completion.
 You can see further evidence of the truncation further down in the task logs, 
 where it displays the count of the records read:  the failed task says 32953 
 records read, while the successful task says 63133.
 Any idea what the problem might be here and/or how to work around it?  This 
 issue is a very common occurrence on our clusters.  E.g., in the job I ran 
 last night before I had gone to bed I had already encountered 8 such 
 failuers, and the job was only 10% complete.  (~25,000 out of ~250,000 tasks.)
 I realize that it's common for I/O errors to occur - possibly even frequently 
 - in a large Hadoop job.  But I would think that if an I/O failure (like a 
 truncated read) did occur, that something in the underlying infrastructure 
 code (i.e., either in NativeS3FileSystem or in jets3t) should detect the 
 error and throw an IOException accordingly.  It shouldn't be up to the 
 calling code to detect such failures, IMO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)