kennknowles commented on pull request #13006:
URL: https://github.com/apache/beam/pull/13006#issuecomment-703916817


   > #12519 (comment)
   
   
   
   > > > > > @kennknowles
   > > > > > > I did a first pass. I think big picture there are two things:
   > > > > > > 
   > > > > > > 1. Without unfolding all the code not in the code review, I am 
not already familiar with the expected level of SDF support of all the runners. 
Some of them seem to previously reject non-SDF reads but now could end up 
having non-SDF reads.
   > > > > > 
   > > > > > 
   > > > > > [BEAM-10670](https://issues.apache.org/jira/browse/BEAM-10670) is 
about migrating all runners to use SDF first and only SDF. All runners that 
supported BoundedSource/UnboundedSource support SDF to the level necessary to 
execute the BoundedSource/UnboundedSource as SDF wrappers. The only ones that 
don't are Spark (underway in #12603) and Dataflow runner v1 which would have a 
degraded experience in batch and would break pipeline update in streaming. This 
was a focus of mine over the past ~6 weeks.
   > > > > 
   > > > > 
   > > > > Do they have equivalent performance? My architectural idea is that 
we can have expansion that looks like:
   > > > 
   > > > 
   > > > Nexmark did show improvements ([#12519 
(comment)](https://github.com/apache/beam/pull/12519#issuecomment-680278642))
   > > > > ```
   > > > > Read(BoundedSource) {
   > > > >   Impulse -> SDF
   > > > > }
   > > > > ```
   > > > > 
   > > > > 
   > > > > preserving all the bounded/unbounded source data but having a 
portable expansion. "Portable" meaning that all non-primitives (like Read) are 
expanded to primitives (SDF). But there is no reason a runner has to support 
SDF at all, if they can recognize the composite and implement it directly.
   > > > > So my basic idea would be to have all such runners continue to use 
their existing execution paths, unless the runner authors/maintainers believe 
whatever SDF support they have is good enough. FWIW I don't think this is my 
invention. I thought this was the plan to seamlessly support Read and SDF in 
the same portable pipeline, with runners picking and choosing what they can 
execute CC @xinyuiscool @iemejia @mxm
   > > > 
   > > > 
   > > > I brought up the migration to using SDF to power 
BoundedSource/UnboundedSource on 
[dev@](https://lists.apache.org/thread.html/r1ba6fe6ac2bd2b28aa7ef31f0d87ad716fc878f2515085fdbc275333%40%3Cdev.beam.apache.org%3E).
 If you would like to reconsider the migration path it would make sense to 
continue this discussion on the original thread.
   > > > > > > 1. I do not think runners core construction should be 
manipulated by global flags, but each runner should decide what to do with the 
flag, if anything. If a runner has a well-used non-portable variant that is not 
going away any time soon, then it can still use the portable pipeline and just 
reject any with non-Java UDFs. If a runner only supports SDF in some modes but 
only supports Read in others, or wants to translate Read specially for 
efficiency (or any other reason it wants to - it can do whatever it wants) then 
it can do that.
   > > > > > 
   > > > > > 
   > > > > > I can do that but it seems like it will lead to possible 
inconsistency across runners around how this choice is made. Currently 
CHANGES.md says to get the old behavior use this flag and it would be 
unfortunate to say "consult the runner documentation as to how to get the old 
behavior".
   > > > > 
   > > > > 
   > > > > There should not be behavior differences. The expansion above should 
have identical PCollection contents whether the runner implements Read or SDF. 
And the runner should make sure that pipeline maintain their performance by 
using the best translation they can. Users should not need to be aware of this, 
except perhaps via debugging interfaces that expose the expansion.
   > > > 
   > > > 
   > > > The purpose of having the fallback is to give users an opt-out 
mechanism during the migration should they hit issues. I don't believe our 
runners have comprehensive testing of pipelines at scale enough to state that 
the migration will be perfect.
   > > > > > After a few releases, the intent is that this logic is removed 
from runners-core and everyone except for Dataflow v1 runner always use the SDF 
wrapper and we move the necessary logic within runners-core to the Dataflow 
module deleting the remainder (the PrimitiveBoundedRead override).
   > > > > 
   > > > > 
   > > > > My position on runners-core and runners-core-construction is that 
these are just libraries with collections of useful functionality for 
implementing a runner. Neither one should ever be user-visible. Deleting dead 
code from either library is nice-to-have cleanup. If any runner wants to keep 
the code it can be copied into the runner.
   > > > 
   > > > 
   > > > I disagree. As someone who went through all our runners changing a 
core piece, dead code would only make Apache Beam harder to maintain.
   > > 
   > > 
   > > "core" is just a name because I/we didn't have a clever name for it. 
Both libraries were created for this purpose and really shouldn't be seen as 
more than a utility library. They are not the definition of any part of the 
Beam model, just helpers.
   > 
   > I didn't mean "core" as in the modules with the name "core" in them but 
the "core" concept of how data is ingested into a pipeline by a runner.
   
   OK. I am only referring to the module name. The modules are just piles of 
functionality that a runner can use as they please. The thing I don't like is 
having a global user-controlled flag that tweaks runner behavior when the 
runner may not even be aware. Basically moving a flag from the SDK to the 
runner is better but still not great.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to