[ https://issues.apache.org/jira/browse/BEAM-3681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16375031#comment-16375031 ]
Ismaël Mejía commented on BEAM-3681: ------------------------------------ It was quite tricky to find, I really found it by pure luck while trying to test the write for the first time with exactly the same example I mentioned. > S3Filesystem fails when copying empty files > ------------------------------------------- > > Key: BEAM-3681 > URL: https://issues.apache.org/jira/browse/BEAM-3681 > Project: Beam > Issue Type: Bug > Components: io-java-aws > Affects Versions: 2.3.0 > Reporter: Ismaël Mejía > Assignee: Ismaël Mejía > Priority: Major > Fix For: 2.4.0 > > Time Spent: 40m > Remaining Estimate: 0h > > When executing a simple write on S3 with the direct runner. It breaks > sometimes when it ends up trying to write 'empty' shards to S3. > {code:java} > Pipeline pipeline = Pipeline.create(options); > pipeline > .apply("CreateSomeData", Create.of("1", "2", "3")) > .apply("WriteToFS", TextIO.write().to(options.getOutput())); > pipeline.run();{code} > The related exception is: > {code:java} > Exception in thread "main" > org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.io.IOException: > com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was > not well-formed or did not validate against our published schema (Service: > Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: > 402E99C2F602AD09; S3 Extended Request ID: > SDdU8AqW2mfZuG1xcKUSNeHiR0IUKcRCpZ1Wjx7sAor1CdYf8f+0dDIcQpvr3GXgqwsyk5PGWVE=), > S3 Extended Request ID: > SDdU8AqW2mfZuG1xcKUSNeHiR0IUKcRCpZ1Wjx7sAor1CdYf8f+0dDIcQpvr3GXgqwsyk5PGWVE= > at > org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:342) > at > org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:312) > at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:206) > at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:62) > at org.apache.beam.sdk.Pipeline.run(Pipeline.java:311) > at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297) > at > org.apache.beam.samples.ingest.amazon.IngestToS3.main(IngestToS3.java:82) > Caused by: java.io.IOException: > com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was > not well-formed or did not validate against our published schema (Service: > Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: > 402E99C2F602AD09; S3 Extended Request ID: > SDdU8AqW2mfZuG1xcKUSNeHiR0IUKcRCpZ1Wjx7sAor1CdYf8f+0dDIcQpvr3GXgqwsyk5PGWVE=), > S3 Extended Request ID: > SDdU8AqW2mfZuG1xcKUSNeHiR0IUKcRCpZ1Wjx7sAor1CdYf8f+0dDIcQpvr3GXgqwsyk5PGWVE= > at org.apache.beam.sdk.io.aws.s3.S3FileSystem.copy(S3FileSystem.java:563) > at > org.apache.beam.sdk.io.aws.s3.S3FileSystem.lambda$copy$4(S3FileSystem.java:495) > at > org.apache.beam.sdk.io.aws.s3.S3FileSystem.lambda$callTasks$8(S3FileSystem.java:642) > at > org.apache.beam.sdk.util.MoreFutures.lambda$supplyAsync$0(MoreFutures.java:100) > at > java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626) > Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: The XML you > provided was not well-formed or did not validate against our published schema > (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: > 402E99C2F602AD09; S3 Extended Request ID: > SDdU8AqW2mfZuG1xcKUSNeHiR0IUKcRCpZ1Wjx7sAor1CdYf8f+0dDIcQpvr3GXgqwsyk5PGWVE=), > S3 Extended Request ID: > SDdU8AqW2mfZuG1xcKUSNeHiR0IUKcRCpZ1Wjx7sAor1CdYf8f+0dDIcQpvr3GXgqwsyk5PGWVE= > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1639) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1304) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1056) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649) > at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513) > at > com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4325) > at > com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4272) > at > com.amazonaws.services.s3.AmazonS3Client.completeMultipartUpload(AmazonS3Client.java:3065) > at org.apache.beam.sdk.io.aws.s3.S3FileSystem.copy(S3FileSystem.java:561) > at > org.apache.beam.sdk.io.aws.s3.S3FileSystem.lambda$copy$4(S3FileSystem.java:495) > at > org.apache.beam.sdk.io.aws.s3.S3FileSystem.lambda$callTasks$8(S3FileSystem.java:642) > at > org.apache.beam.sdk.util.MoreFutures.lambda$supplyAsync$0(MoreFutures.java:100) > at > java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748){code} > After further investigation I found that the output of FileBasedSink can > produce empty files, but the copy method of S3FileSystem breaks when trying > to copy an empty file. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)