[
https://issues.apache.org/jira/browse/BEAM-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15977752#comment-15977752
]
ASF GitHub Bot commented on BEAM-1867:
--------------------------------------
GitHub user kennknowles opened a pull request:
https://github.com/apache/beam/pull/2618
[BEAM-1867] Use step-derived PCollection names in Dataflow
Be sure to do all of the following to help us incorporate your contribution
quickly and easily:
- [ ] Make sure the PR title is formatted like:
`[BEAM-<Jira issue #>] Description of pull request`
- [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
Travis-CI on your fork and ensure the whole test matrix passes).
- [ ] Replace `<Jira issue #>` in the title with the actual Jira issue
number, if there is one.
- [ ] If this contribution is large, please file an Apache
[Individual Contributor License
Agreement](https://www.apache.org/licenses/icla.pdf).
---
R: @bjchambers
This mitigates an issue in Dataflow. I also removed some checked exceptions
that are never caught and probably never should be.
I have empirically checked that the element counts and byte sizes are
restored by this change, and added unit tests to the translator. Integration
tests TBD.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/kennknowles/beam Dataflow-PCollection-names
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/beam/pull/2618.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2618
----
commit 4c0bdd6c002b83c67daedd5e01ee2ad0dd47c233
Author: Kenneth Knowles <[email protected]>
Date: 2017-04-20T21:32:29Z
Make crashing errors in Structs unchecked exceptions
commit c9ed8f9a69d2b3f17e782f4bd0da9bd4305f2320
Author: Kenneth Knowles <[email protected]>
Date: 2017-04-20T22:32:51Z
Derive Dataflow output names from steps, not PCollection names
Long ago, PCollection names were assigned after transform replacements took
place, because this happened interleaved with pipeline construction. Now,
runner-independent graphs are constructed with named PCollections and when
replacements occur, the names are preserved. This exposed a bug in Dataflow
whereby the names of steps and the names of PCollections are tightly
coupled.
This change uses the mandatory derived names during translation, shielding
users from the bug.
----
> Element counts missing on Cloud Dataflow when PCollection has anything other
> than hardcoded name pattern
> --------------------------------------------------------------------------------------------------------
>
> Key: BEAM-1867
> URL: https://issues.apache.org/jira/browse/BEAM-1867
> Project: Beam
> Issue Type: Bug
> Components: runner-dataflow
> Reporter: Kenneth Knowles
> Assignee: Kenneth Knowles
> Priority: Blocker
> Fix For: First stable release
>
>
> In 0.6.0 and 0.7.0-SNAPSHOT (and possibly all past versions, these are just
> those where it is confirmed) element count and byte metrics are not reported
> correctly when the output PCollection for a primitive transform is not
> {{transformname + ".out" + index}}.
> In 0.7.0-SNAPSHOT, the DataflowRunner uses pipeline surgery to replace the
> composite {{ParDoSingle}} (that contains a {{ParDoMulti}}) with a
> Dataflow-specific non-composite {{ParDoSingle}}. So metrics are reported for
> names like {{"ParDoSingle(MyDoFn).out"}} when they should be reported for
> {{"ParDoSingle/ParDoMulti(MyDoFn).out"}}. So all single-output ParDo
> transforms lack these metrics on their outputs.
> In 0.6.0 the same problem occurs if the user ever uses
> {{PCollection.setName}} to give their collection a meaningful name.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)