[jira] [Work logged] (BEAM-10670) Make non-portable Splittable DoFn the only option when executing Java "Read" transforms

ASF GitHub Bot (Jira) Mon, 05 Oct 2020 13:46:13 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-10670?focusedWorklogId=495574&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495574
 ]


ASF GitHub Bot logged work on BEAM-10670:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 05/Oct/20 20:45
            Start Date: 05/Oct/20 20:45
    Worklog Time Spent: 10m 
      Work Description: kennknowles commented on pull request #13006:
URL: https://github.com/apache/beam/pull/13006#issuecomment-703878371


   > @kennknowles
   > 
   > > I did a first pass. I think big picture there are two things:
   > > 
   > > 1. Without unfolding all the code not in the code review, I am not 
already familiar with the expected level of SDF support of all the runners. 
Some of them seem to previously reject non-SDF reads but now could end up 
having non-SDF reads.
   > 
   > [BEAM-10670](https://issues.apache.org/jira/browse/BEAM-10670) is about 
migrating all runners to use SDF first and only SDF. All runners that supported 
BoundedSource/UnboundedSource support SDF to the level necessary to execute the 
BoundedSource/UnboundedSource as SDF wrappers. The only ones that don't are 
Spark (underway in #12603) and Dataflow runner v1 which would have a degraded 
experience in batch and would break pipeline update in streaming. This was a 
focus of mine over the past ~6 weeks.
   
   Do they have equivalent performance? My architectural idea is that we can 
have expansion that looks like:
   
   ```
   Read(BoundedSource) {
     Impulse -> SDF
   }
   ```
   
   preserving all the bounded/unbounded source data but having a portable 
expansion. "Portable" meaning that all non-primitives (like Read) are expanded 
to primitives (SDF). But there is no reason a runner has to support SDF at all, 
if they can recognize the composite and implement it directly.
   
   So my basic idea would be to have all such runners continue to use their 
existing execution paths, unless the runner authors/maintainers believe 
whatever SDF support they have is good enough. FWIW I don't think this is my 
invention. I thought this was the plan to seamlessly support Read and SDF in 
the same portable pipeline, with runners picking and choosing what they can 
execute CC @xinyuiscool @iemejia @mxm
   
   > > 1. I do not think runners core construction should be manipulated by 
global flags, but each runner should decide what to do with the flag, if 
anything. If a runner has a well-used non-portable variant that is not going 
away any time soon, then it can still use the portable pipeline and just reject 
any with non-Java UDFs. If a runner only supports SDF in some modes but only 
supports Read in others, or wants to translate Read specially for efficiency 
(or any other reason it wants to - it can do whatever it wants) then it can do 
that.
   > 
   > I can do that but it seems like it will lead to possible inconsistency 
across runners around how this choice is made. Currently CHANGES.md says to get 
the old behavior use this flag and it would be unfortunate to say "consult the 
runner documentation as to how to get the old behavior".
   
   There should not be behavior differences. The expansion above should have 
identical PCollection contents whether the runner implements Read or SDF. And 
the runner should make sure that pipeline maintain their performance by using 
the best translation they can. Users should not need to be aware of this, 
except perhaps via debugging interfaces that expose the expansion.
    
   > After a few releases, the intent is that this logic is removed from 
runners-core and everyone except for Dataflow v1 runner always use the SDF 
wrapper and we move the necessary logic within runners-core to the Dataflow 
module deleting the remainder (the PrimitiveBoundedRead override).
   
   My position on runners-core and runners-core-construction is that these are 
just libraries with collections of useful functionality for implementing a 
runner. Neither one should ever be user-visible. Deleting dead code from either 
library is nice-to-have cleanup. If any runner wants to keep the code it can be 
copied into the runner.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 495574)
    Time Spent: 26h 50m  (was: 26h 40m)

> Make non-portable Splittable DoFn the only option when executing Java "Read" 
> transforms
> ---------------------------------------------------------------------------------------
>
>                 Key: BEAM-10670
>                 URL: https://issues.apache.org/jira/browse/BEAM-10670
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-core
>            Reporter: Luke Cwik
>            Assignee: Luke Cwik
>            Priority: P2
>          Time Spent: 26h 50m
>  Remaining Estimate: 0h
>
> All runners seem to be capable of migrating to splittable DoFn for 
> non-portable execution except for Dataflow runner v1 which will internalize 
> the current primitive read implementation that is shared across runner 
> implementations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-10670) Make non-portable Splittable DoFn the only option when executing Java "Read" transforms

Reply via email to