[ 
https://issues.apache.org/jira/browse/BEAM-7709?focusedWorklogId=274396&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-274396
 ]

ASF GitHub Bot logged work on BEAM-7709:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 09/Jul/19 21:18
            Start Date: 09/Jul/19 21:18
    Worklog Time Spent: 10m 
      Work Description: lostluck commented on issue #9015: [BEAM-7709] Re-use 
node for explicit flattens
URL: https://github.com/apache/beam/pull/9015#issuecomment-509813361
 
 
   That's right! The "shortcut" on line 320 avoids repeating work in case 
something else is using that PCollection. Good ol' 
[memoization](https://en.wikipedia.org/wiki/Memoization).
   Before it would, get a link from a PCollection, and just create a flatten 
*per* PCollection. As Luke mentions, this is totally fine if the runner "split" 
different upstream DoFns into different bundle executions, which naturally 
leads to multiple instances of the downstream DoFn. That's fine*. While it's 
unnecessary to have the flatten unit in that instance, it's not wrong.
   However, if the two pcollections are in the *same* bundle execution, and 
there's only a single downstream consumer for the flatten, having multiple 
flatten nodes like the code was doing before is wrong, since the SDK side 
flatten has one job essentially: to ensure that the downstream StartBundles and 
FinishBundles are only ever called once per bundle. Naturally, this doesn't 
work with multiple instances of the same flatten node.
   So the solution is to ensure all the inputs to a given Flatten Unit are 
mapped to a the same instance of that flatten unit.
   
   *except as, Luke mentioned, for stateful processing, where we'll need to be 
sure if we're creating multiple instances of the same DoFn, that they're passed 
the same "state holder".
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 274396)
    Time Spent: 1.5h  (was: 1h 20m)

> Flattening multiple outputs of a ParDoN fails
> ---------------------------------------------
>
>                 Key: BEAM-7709
>                 URL: https://issues.apache.org/jira/browse/BEAM-7709
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-go
>    Affects Versions: Not applicable
>            Reporter: Robert Burke
>            Assignee: Robert Burke
>            Priority: Major
>             Fix For: Not applicable
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> If a user does a beam.ParDoN for pardo > 2  and then passes one or more of 
> the outputs to a flatten, then if the flatten occurs SDK side, it currently 
> creates multiple flatten nodes, which then triggers the downstream pardo (the 
> DoFn that consumes the Flatten's output) to be initialized multiple times for 
> a single bundle.
> The fix is to pre-emptively populate the input links with the first created 
> flatten, so subsequent tracings of the plan use the same flatten node the 
> same way the Go direct runner does[1]. That would happen in the exec 
> translate code.
> [[1] 
> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/direct/direct.go#L299|https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/direct/direct.go#L299]
> [[2] 
> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/exec/translate.go#L493|https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/exec/translate.go#L493]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to