lostluck commented on pull request #15482: URL: https://github.com/apache/beam/pull/15482#issuecomment-928033815
I've determined the cause of the failures, that were only visible due to this PR. It turns out the wordcount test is bad. Here's what's happening: 1. As written, the wordcount test [writes the test data to an in memory filesystem](https://github.com/apache/beam/blob/master/sdks/go/test/integration/wordcount/wordcount_test.go#L83). This file system is only available in-process. This means that it will only be available when executing on the direct runner, or in LOOPBACK mode. 2. The question then becomes: Why doesn't the test fail if there are no files? The problem is in how the [textio is implemented](https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/io/textio/textio.go#L76): It uses the filename as a glob to list matching files, and then emitting them. It doesn't see "no files" as a failure case. Whether that's actually a bug is a separate concern. 3. Either way, no file names are sent downstream meaning subsequent DoFns don't execute, leading to no counters at all. My tip-off was that clearly the pipelines were executing, and Flink was returning counters for all the PCollections, but not for any PTransforms (or at least, not for extract). The correct move for now is to create a new `WordCountFromPCol` function in the integration version of the wordcount package. It will do everything after the `textio.Read` in the WordCount function, but take in a scope and a PCollection as input instead of the glob. The existing WordCount should call this new function instead of having everything duplicated. In the tests, instead of writing the data to an in memory file, we write it using `beam.Create` (or `beam.CreateList`), and pass in the PCollection. At which point the test should operate properly for all runners. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
