[ 
https://issues.apache.org/jira/browse/HADOOP-16900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050575#comment-17050575
 ] 

Andrew Olson edited comment on HADOOP-16900 at 3/3/20 9:39 PM:
---------------------------------------------------------------

[[email protected]] [~gabor.bota] I think we should fail fast here without 
writing anything. The fail fast part already apparently happens as seen by 
distcp task failures like this,
{noformat}
19/12/26 21:45:54 INFO mapreduce.Job: Task Id : 
attempt_1576854935249_175694_m_000003_0, Status : FAILED
Error: java.io.IOException: File copy failed: hdfs://cluster/path/to/file.avro 
--> s3a://bucket/path/to/file.avro
        at 
org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:312)
        at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:270)
        at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:52)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.io.IOException: Couldn't run retriable-command: Copying 
hdfs://cluster/path/to/file.avro to s3a://bucket/path/to/file.avro
        at 
org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101)
        at 
org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:307)
        ... 10 more
Caused by: java.lang.IllegalArgumentException: partNumber must be between 1 and 
10000 inclusive, but is 10001
        at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:115)
        at 
org.apache.hadoop.fs.s3a.S3AFileSystem$WriteOperationHelper.newUploadPartRequest(S3AFileSystem.java:3086)
        at 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.uploadBlockAsync(S3ABlockOutputStream.java:492)
        at 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.access$000(S3ABlockOutputStream.java:469)
        at 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream.uploadCurrentBlock(S3ABlockOutputStream.java:307)
        at 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream.write(S3ABlockOutputStream.java:289)
        at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
        at 
org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyBytes(RetriableFileCopyCommand.java:299)
        at 
org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyToFile(RetriableFileCopyCommand.java:216)
        at 
org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:146)
        at 
org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:116)
        at 
org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
        ... 11 more
{noformat}
However... in distcp there's an optimization where it skips files that already 
exist in the destination (I think that match a checksum of some number of 
initial bytes?), so when the task attempt was retried it found no work to do 
for this large file, and the distcp job ultimately succeeded. I probably should 
open a separate issue to have distcp compare the file lengths before deciding 
that a path already present in the target location can be safely skipped.

We ran into this because we had tuned the fs.s3a.multipart.size down to 25M. 
That effectively imposed a 250 GB limit on our S3A file writes due to the 
10,000 part limit on the AWS side. Adjusting the multipart chunk size to a 
larger value (100M) got us past this since all our files are under 1 TB.

Here's evidence of the issue we recorded noting the unexpected 289789841890 vs. 
262144000000 bytes file size difference, with names changed to protect the 
innocent.
{noformat}
$ hdfs dfs -ls /path/to/file.avro
-rwxrwxr-x   3 user group 289789841890 2016-12-20 06:37 /path/to/file.avro

$ hdfs dfs -conf conf.xml -ls s3a://bucket/path/to/file.avro
-rw-rw-rw-   1 user 262144000000 2019-12-26 21:45 s3a://bucket/path/to/file.avro
{noformat}


was (Author: noslowerdna):
[[email protected]] [~gabor.bota] I think we should fail fast here without 
writing anything. The fail fast part already apparently happens as seen by 
distcp task failures like this,
{noformat}
19/12/26 21:45:54 INFO mapreduce.Job: Task Id : 
attempt_1576854935249_175694_m_000003_0, Status : FAILED
Error: java.io.IOException: File copy failed: hdfs://cluster/path/to/file.avro 
--> s3a://bucket/path/to/file.avro
        at 
org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:312)
        at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:270)
        at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:52)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.io.IOException: Couldn't run retriable-command: Copying 
hdfs://cluster/path/to/file.avro to s3a://bucket/path/to/file.avro
        at 
org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101)
        at 
org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:307)
        ... 10 more
Caused by: java.lang.IllegalArgumentException: partNumber must be between 1 and 
10000 inclusive, but is 10001
        at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:115)
        at 
org.apache.hadoop.fs.s3a.S3AFileSystem$WriteOperationHelper.newUploadPartRequest(S3AFileSystem.java:3086)
        at 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.uploadBlockAsync(S3ABlockOutputStream.java:492)
        at 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.access$000(S3ABlockOutputStream.java:469)
        at 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream.uploadCurrentBlock(S3ABlockOutputStream.java:307)
        at 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream.write(S3ABlockOutputStream.java:289)
        at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
        at 
org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyBytes(RetriableFileCopyCommand.java:299)
        at 
org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyToFile(RetriableFileCopyCommand.java:216)
        at 
org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:146)
        at 
org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:116)
        at 
org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
        ... 11 more
{noformat}
However... in distcp there's an optimization where it skips files that already 
exist in the destination (I think that match a checksum of some number of 
initial bytes?), so when the task attempt was retried it found no work to do 
for this large file, and the distcp job ultimately succeeded. I probably should 
open a separate issue to have distcp compare the file lengths before deciding 
that a path already present in the target location can be safely skipped.

We ran into this because we had tuned fs.s3a.multipart.size down to 25M. That 
effectively imposed a 250 GB limit on our S3A file writes due to the 10,000 
part limit on the AWS side. Adjusting the multipart chunk size to a larger 
value (100M) got us past this since all our files are under 1 TB.

Here's evidence of the issue we recorded noting the unexpected 289789841890 vs. 
262144000000 bytes file size difference, with names changed to protect the 
innocent.
{noformat}
$ hdfs dfs -ls /path/to/file.avro
-rwxrwxr-x   3 user group 289789841890 2016-12-20 06:37 /path/to/file.avro

$ hdfs dfs -conf conf.xml -ls s3a://bucket/path/to/file.avro
-rw-rw-rw-   1 ao6517 262144000000 2019-12-26 21:45 
s3a://bucket/path/to/file.avro
{noformat}

> Very large files can be truncated when written through S3AFileSystem
> --------------------------------------------------------------------
>
>                 Key: HADOOP-16900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16900
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 3.2.1
>            Reporter: Andrew Olson
>            Assignee: Steve Loughran
>            Priority: Major
>              Labels: s3
>
> If a written file size exceeds 10,000 * {{fs.s3a.multipart.size}}, a corrupt 
> truncation of the S3 object will occur as the maximum number of parts in a 
> multipart upload is 10,000 as specific by the S3 API and there is an apparent 
> bug where this failure is not fatal, and the multipart upload is allowed to 
> be marked as completed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to