[ 
https://issues.apache.org/jira/browse/BEAM-3268?focusedWorklogId=92099&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92099
 ]

ASF GitHub Bot logged work on BEAM-3268:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 18/Apr/18 12:05
            Start Date: 18/Apr/18 12:05
    Worklog Time Spent: 10m 
      Work Description: lgajowy commented on a change in pull request #5159: 
[BEAM-3268] Reshuffle filenames before returning them from WriteFilesResult
URL: https://github.com/apache/beam/pull/5159#discussion_r182399377
 
 

 ##########
 File path: 
sdks/java/core/src/test/java/org/apache/beam/sdk/io/TextIOWriteTest.java
 ##########
 @@ -382,6 +383,10 @@ public Void apply(Iterable<String> values) {
           matches.add(match.resourceId().toString());
         }
         assertThat(values, containsInAnyOrder(Iterables.toArray(matches, 
String.class)));
+        // Verify that files exist.
+        for (String filename : values) {
+          FileSystems.match(filename, EmptyMatchTreatment.DISALLOW);
+        }
 
 Review comment:
   Wouldn't it be better to have it tested in `WriteFilesTest` instead of 
`TextIOWriteTest`? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 92099)
    Time Spent: 50m  (was: 40m)

> getPerDestinationOutputFilenames() is getting processed before write is 
> finished on dataflow runner
> ---------------------------------------------------------------------------------------------------
>
>                 Key: BEAM-3268
>                 URL: https://issues.apache.org/jira/browse/BEAM-3268
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-dataflow
>    Affects Versions: 2.3.0
>            Reporter: Kamil Szewczyk
>            Assignee: Eugene Kirpichov
>            Priority: Major
>         Attachments: comparison.png
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> While running filebased-io-test we found dataflow-runnner misbehaving. We run 
> tests using single pipeline and without using Reshuffling between writing and 
> reading dataflow jobs are unsuccessful because the runner tries to access the 
> files that were not created yet. 
> On the picture the difference between execution of writting is presented. On 
> the left there is working example with Reshuffling added and on the right 
> without it.
> !comparison.png|thumbnail!
> Steps to reproduce: substitute your-bucket-name wit your valid bucket.
> {code:java}
> mvn -e -Pio-it verify -pl sdks/java/io/file-based-io-tests 
> -DintegrationTestPipelineOptions='["--runner=dataflow", 
> "--filenamePrefix=gs://your-bucket-name/TEXTIO_IT"]' -Pdataflow-runner
> {code}
> Then look on the cloud console and job should fail.
> Now add Reshuffling to 
> sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java
>  as in the example.
> {code:java}
> .getPerDestinationOutputFilenames().apply(Values.<String>create())
>         .apply(Reshuffle.<String>viaRandomKey());
>     PCollection<String> consolidatedHashcode = testFilenames
> {code}
> and trigger previously used maven command to see it working in the console 
> right now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to