[ https://issues.apache.org/jira/browse/BEAM-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15977752#comment-15977752 ]
ASF GitHub Bot commented on BEAM-1867: -------------------------------------- GitHub user kennknowles opened a pull request: https://github.com/apache/beam/pull/2618 [BEAM-1867] Use step-derived PCollection names in Dataflow Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-<Jira issue #>] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `<Jira issue #>` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- R: @bjchambers This mitigates an issue in Dataflow. I also removed some checked exceptions that are never caught and probably never should be. I have empirically checked that the element counts and byte sizes are restored by this change, and added unit tests to the translator. Integration tests TBD. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kennknowles/beam Dataflow-PCollection-names Alternatively you can review and apply these changes as the patch at: https://github.com/apache/beam/pull/2618.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2618 ---- commit 4c0bdd6c002b83c67daedd5e01ee2ad0dd47c233 Author: Kenneth Knowles <k...@google.com> Date: 2017-04-20T21:32:29Z Make crashing errors in Structs unchecked exceptions commit c9ed8f9a69d2b3f17e782f4bd0da9bd4305f2320 Author: Kenneth Knowles <k...@google.com> Date: 2017-04-20T22:32:51Z Derive Dataflow output names from steps, not PCollection names Long ago, PCollection names were assigned after transform replacements took place, because this happened interleaved with pipeline construction. Now, runner-independent graphs are constructed with named PCollections and when replacements occur, the names are preserved. This exposed a bug in Dataflow whereby the names of steps and the names of PCollections are tightly coupled. This change uses the mandatory derived names during translation, shielding users from the bug. ---- > Element counts missing on Cloud Dataflow when PCollection has anything other > than hardcoded name pattern > -------------------------------------------------------------------------------------------------------- > > Key: BEAM-1867 > URL: https://issues.apache.org/jira/browse/BEAM-1867 > Project: Beam > Issue Type: Bug > Components: runner-dataflow > Reporter: Kenneth Knowles > Assignee: Kenneth Knowles > Priority: Blocker > Fix For: First stable release > > > In 0.6.0 and 0.7.0-SNAPSHOT (and possibly all past versions, these are just > those where it is confirmed) element count and byte metrics are not reported > correctly when the output PCollection for a primitive transform is not > {{transformname + ".out" + index}}. > In 0.7.0-SNAPSHOT, the DataflowRunner uses pipeline surgery to replace the > composite {{ParDoSingle}} (that contains a {{ParDoMulti}}) with a > Dataflow-specific non-composite {{ParDoSingle}}. So metrics are reported for > names like {{"ParDoSingle(MyDoFn).out"}} when they should be reported for > {{"ParDoSingle/ParDoMulti(MyDoFn).out"}}. So all single-output ParDo > transforms lack these metrics on their outputs. > In 0.6.0 the same problem occurs if the user ever uses > {{PCollection.setName}} to give their collection a meaningful name. -- This message was sent by Atlassian JIRA (v6.3.15#6346)