lostluck commented on pull request #15482:
URL: https://github.com/apache/beam/pull/15482#issuecomment-928033815


   I've determined the cause of the failures, that were only visible due to 
this PR. It turns out the wordcount test is bad.
   
   Here's what's happening: 
   1. As written, the wordcount test [writes the test data to an in memory 
filesystem](https://github.com/apache/beam/blob/master/sdks/go/test/integration/wordcount/wordcount_test.go#L83).
 This file system is only available in-process. This means that it will only be 
available when executing on the direct runner, or in LOOPBACK mode. 
   
   2. The question then becomes: Why doesn't the test fail if there are no 
files? The problem is in how the [textio is 
implemented](https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/io/textio/textio.go#L76):
 It uses the filename as a glob to list matching files, and then emitting them. 
It doesn't see "no files" as a failure case. Whether that's actually a bug is a 
separate concern.
   3. Either way, no file names are sent downstream meaning subsequent DoFns 
don't execute, leading to no counters at all.
   
   My tip-off was that clearly the pipelines were executing, and Flink was 
returning counters for all the PCollections, but not for any PTransforms (or at 
least, not for extract).
   
   The correct move for now is to create a new `WordCountFromPCol` function in 
the integration version of the wordcount package. It will do everything after 
the `textio.Read` in the WordCount  function, but take in a scope and a 
PCollection as input instead of the glob. The existing WordCount should call 
this new function instead of having everything duplicated.
   
   In the tests, instead of writing the data to an in memory file, we write it 
using `beam.Create` (or `beam.CreateList`), and pass in the PCollection. At 
which point the test should operate properly for all runners.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to