[ 
https://issues.apache.org/jira/browse/BEAM-3681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372094#comment-16372094
 ] 

Ismaël Mejía edited comment on BEAM-3681 at 2/22/18 8:35 AM:
-------------------------------------------------------------

In addition to this the current copy method is slightly inefficient, because it 
uses the more robust MultiPart API that supports bigger than 5GB files, but 
pays the price of requiring at least 3 requests to copy every smaller file 
instead of 1 that is the case of the basic s3client.copyObject API.


was (Author: iemejia):
In addition to this the current copy method is slightly inefficient, because it 
uses the more robust MultiPart API that supports bigger than 5GB files, but 
pays the price of requiring 3 requests to copy every smaller file instead of 1 
that is the case of the basic s3client.copyObject API.

> S3Filesystem fails when copying empty files
> -------------------------------------------
>
>                 Key: BEAM-3681
>                 URL: https://issues.apache.org/jira/browse/BEAM-3681
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-aws
>    Affects Versions: 2.3.0
>            Reporter: Ismaël Mejía
>            Assignee: Ismaël Mejía
>            Priority: Major
>             Fix For: 2.4.0
>
>
> When executing a simple write on S3 with the direct runner. It breaks 
> sometimes when it ends up trying to write 'empty' shards to S3.
> {code:java}
> Pipeline pipeline = Pipeline.create(options);
> pipeline
>  .apply("CreateSomeData", Create.of("1", "2", "3"))
>  .apply("WriteToFS", TextIO.write().to(options.getOutput()));
> pipeline.run();{code}
> The related exception is:
> {code:java}
> Exception in thread "main" 
> org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.io.IOException: 
> com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was 
> not well-formed or did not validate against our published schema (Service: 
> Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: 
> 402E99C2F602AD09; S3 Extended Request ID: 
> SDdU8AqW2mfZuG1xcKUSNeHiR0IUKcRCpZ1Wjx7sAor1CdYf8f+0dDIcQpvr3GXgqwsyk5PGWVE=),
>  S3 Extended Request ID: 
> SDdU8AqW2mfZuG1xcKUSNeHiR0IUKcRCpZ1Wjx7sAor1CdYf8f+0dDIcQpvr3GXgqwsyk5PGWVE=
>     at 
> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:342)
>     at 
> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:312)
>     at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:206)
>     at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:62)
>     at org.apache.beam.sdk.Pipeline.run(Pipeline.java:311)
>     at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297)
>     at 
> org.apache.beam.samples.ingest.amazon.IngestToS3.main(IngestToS3.java:82)
> Caused by: java.io.IOException: 
> com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was 
> not well-formed or did not validate against our published schema (Service: 
> Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: 
> 402E99C2F602AD09; S3 Extended Request ID: 
> SDdU8AqW2mfZuG1xcKUSNeHiR0IUKcRCpZ1Wjx7sAor1CdYf8f+0dDIcQpvr3GXgqwsyk5PGWVE=),
>  S3 Extended Request ID: 
> SDdU8AqW2mfZuG1xcKUSNeHiR0IUKcRCpZ1Wjx7sAor1CdYf8f+0dDIcQpvr3GXgqwsyk5PGWVE=
>     at org.apache.beam.sdk.io.aws.s3.S3FileSystem.copy(S3FileSystem.java:563)
>     at 
> org.apache.beam.sdk.io.aws.s3.S3FileSystem.lambda$copy$4(S3FileSystem.java:495)
>     at 
> org.apache.beam.sdk.io.aws.s3.S3FileSystem.lambda$callTasks$8(S3FileSystem.java:642)
>     at 
> org.apache.beam.sdk.util.MoreFutures.lambda$supplyAsync$0(MoreFutures.java:100)
>     at 
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)
> Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: The XML you 
> provided was not well-formed or did not validate against our published schema 
> (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: 
> 402E99C2F602AD09; S3 Extended Request ID: 
> SDdU8AqW2mfZuG1xcKUSNeHiR0IUKcRCpZ1Wjx7sAor1CdYf8f+0dDIcQpvr3GXgqwsyk5PGWVE=),
>  S3 Extended Request ID: 
> SDdU8AqW2mfZuG1xcKUSNeHiR0IUKcRCpZ1Wjx7sAor1CdYf8f+0dDIcQpvr3GXgqwsyk5PGWVE=
>     at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1639)
>     at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1304)
>     at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1056)
>     at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
>     at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
>     at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
>     at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
>     at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
>     at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
>     at 
> com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4325)
>     at 
> com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4272)
>     at 
> com.amazonaws.services.s3.AmazonS3Client.completeMultipartUpload(AmazonS3Client.java:3065)
>     at org.apache.beam.sdk.io.aws.s3.S3FileSystem.copy(S3FileSystem.java:561)
>     at 
> org.apache.beam.sdk.io.aws.s3.S3FileSystem.lambda$copy$4(S3FileSystem.java:495)
>     at 
> org.apache.beam.sdk.io.aws.s3.S3FileSystem.lambda$callTasks$8(S3FileSystem.java:642)
>     at 
> org.apache.beam.sdk.util.MoreFutures.lambda$supplyAsync$0(MoreFutures.java:100)
>     at 
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748){code}
>  After further investigation I found that the output of FileBasedSink can 
> produce empty files, but the copy method of S3FileSystem breaks when trying 
> to copy an empty file.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to