Re: Primitive Read not working with Flink portable runner

Jan Lukavský Sun, 01 Aug 2021 11:34:21 -0700

Hi,

I have figured out another way of fixing the problem without modifyingModelCoders. It consists of creating a JavaSDKCoderTranslatorRegistrar[1] and fixing LengthPrefixUnknownCoders [2]. Would this be a betterapproach?

Jan

[1]https://github.com/apache/beam/pull/15181/files#diff-e4df94a4e799e14a76ada42506aacb8cb7567c84349acacd6126c64ed03de062R27

[2]https://github.com/apache/beam/pull/15181/files#diff-64103a1eabf2872230e5df56cf02d535c4146f5a3f67c51c261433e4caa9a972R63


On 7/29/21 7:54 PM, Jan Lukavský wrote:

On 7/29/21 6:45 PM, Robert Bradshaw wrote:
On Thu, Jul 29, 2021 at 3:04 AM Jan Lukavský <[email protected]> wrote:
Hi,
I'd like to move the discussion of this topic further. Because itseems that fixing the portable SDF is a larger work, I think thereare two options:
+1
a) extend the definition of model coders to include SDK coders ofthe language that implement the model (that would mean that thedefinition of model coder is not "language agnostic coders", but"coders that a given SDK can instantiate"), or
b) make the model coders extensible so that a runner can modify it- that would make it possible for each runner to have a slightlydifferent definition of these model coders
I'm strongly in favor of a), but I can live with b) as well.
We should probably just rename "ModelCoders" to
"JavaCoders[Registrar]" and stick everything there. ModelCoders is not
understood or used by anything but Java. (That or we just discard the
whole ModelCoders thing and just let Coders define their own portable
representations, possibly with a registration system.)
Coders must be Serializable, so it seems to me, that all Java Codersare quite easily serialized and a registration is not exactly neededfor that. Renaming ModelCoders to Java(Portable)Coders looks good to me.
Thanks in advance for any comments on this.

  Jan

On 7/25/21 8:59 PM, Jan Lukavský wrote:
I didn't want to say that Flink should not support SDF. I only donot see any benefits of it for a native streaming source - likeKafka - without the ability to use dynamic splitting. The potentialbenefits of composability and extensibility do not apply here. Yes,it would be good to have as low number of source transforms aspossible. And another yes, there probably isn't anything that wouldfundamentally disable Flink to correctly support SDF. On the otherhand, the current state is such we cannot use KafkaIO in Flink. Ithink we should fix this by the shortest possible path, because thetechnically correct solution is currently unknown (at least to me,if anyone can give pointers about how to fix the SDF, I'd be grateful).
I still think that enabling a runner to support Read natively, whenappropriate, has value by itself. And it requires SDK Coders to be'known' to the runner, at least that was the result of my tests.
On 7/25/21 8:31 PM, Chamikara Jayalath wrote:



On Sun, Jul 25, 2021 at 11:09 AM Jan Lukavský <[email protected]> wrote:
In general, language-neutral APIs and protocols are a key featureof portable Beam.
Yes, sure, that is well understood. But - language neutral APIsrequires language neutral environment. That is why the portablePipeline representation is built around protocol buffers and gRPC.That is truly language-neutral. Once we implement something aroundthat - like in the case of ModelCoders.java - we use a specificlanguage for that and the language-neutral part is already gone.The decision to include same-language-SDK coders into suchlanguage-specific object plays no role in the fact it already islanguage-specific.
Not all runners are implemented using Java. For example, theportable DirectRunner (FnAPI runner) is implemented using Pythonand Dataflow is implemented using C++. Such runners will not beable to do this.
Yes, I'm aware of that and that is why I said "any Java nativerunner". It is true, that non-Java runners *must* (as long as wedon't include Read into SDK harness) resort to expanding it to SDF.That is why use_deprecated_read is invalid setting for such runnerand should be handled accordingly.
Similarly, I think there were previous discussions related to usingSDF as the source framework for portable runners.
Don't get me wrong, I'm not trying to revoke this decision. On theother hand I still think that the decision to use SDFimplementation of Read or not should be left to the runner.
I understand that there are some bugs related to SDF and portableFlink currently. How much work do you think is needed here ? Willit be better to focus our efforts on fixing remaining issues forSDF and portable runners instead of supporting"use_deprecated_read" for that path ?
I'm not sure. I don't know portability and the SDK harness wellenough to be able to answer this. But we should really know why wedo that. What exactly does SDF bring to the Flink runner (and let'sleave Flink aside of this - what does it bring to runners thatcannot make use of dynamic splitting, being it admittedly a verycool feature)? Yes, supporting Java Read makes it impossible toimplement it in Python. But practically, I think that most of thePipelines will use x-lang for that. It makes very much sense tooffload IOs to a more performant environment.
A bit old, but please see the following for the benefits of SDF andthe motivation for it.
https://beam.apache.org/blog/splittable-do-fn/
https://s.apache.org/splittable-do-fn

Thanks,
Cham
  Jan

On 7/25/21 6:54 PM, Chamikara Jayalath wrote:



On Sun, Jul 25, 2021 at 6:33 AM Jan Lukavský <[email protected]> wrote:
I'll start from the end.
I don't think we should be breaking language agnostic API layers(for example, definition of model coders) just to support"use_deprecated_read".
"Breaking" and "fixing" can only be a matter of the definition ofthe object at hand. I don't think, that Coder can be totallylanguage agnostic - yes, the mapping between serialized form anddeserialized form can be _defined_ in a language agnostic way, butmust be_implemented_ in a specific language. If we choose theimplementing language, what makes us treat SDK-specific codersdefined by the SDK of the same language as "unknown"? It is onlyour decision, that seems to have no practical benefits.
In general, language-neutral APIs and protocols are a key featureof portable Beam. See here:https://beam.apache.org/roadmap/portability/(I did not look into all the old discussions and votes related tothis but I'm sure they are there)
Moreover, including SDK-specific coders into supported coders ofthe SDK runner construction counterpart (that is, runnercore-construction-java for Java SDK) is a necessary prerequisitefor unifying "classical" and "portable" runners, because therunner needs to understand *all* SDK coders so that it can_inline_ the complete Pipeline (if the Pipeline SDK has the samelanguage as the runner), instead of running it through SDKharness. This need therefore is not specific to supportinguse_deprecated_read, but is a generic requirement, which only hasthe first manifestation in the support of a transform notsupported by SDK harness.
I think "use_deprecated_read" should be considered a stop-gapmeasure for Flink (and Spark ?) till we have proper support forSDF. In fact I don't think an arbitrary portable runner cansupport "use_deprecated_read" due to the following.
There seems to be nothing special about Flink regarding thesupport of primitive Read. I think any Java native runner canimplement it pretty much the same way as Flink does. The questionis if any other runner might want to do that. The problem withFlink is that
Not all runners are implemented using Java. For example, theportable DirectRunner (FnAPI runner) is implemented using Pythonand Dataflow is implemented using C++. Such runners will not beable to do this.
  1) portable SDF seems not to work [1]
2) even classical Flink runner has still issues with SDF - thereare reports of watermark being stuck when reading data via SDF,this gets resolved using use_deprecated_read
3) Flink actually does not have any benefits from SDF, becauseit cannot make use of the dynamic splitting, so this actuallybrings only implementation burden without any practical benefit
Similarly, I think there were previous discussions related to usingSDF as the source framework for portable runners.I understand that there are some bugs related to SDF and portableFlink currently. How much work do you think is needed here ? Willit be better to focus our efforts on fixing remaining issues forSDF and portable runners instead of supporting"use_deprecated_read" for that path ? Note that I'm fine withfixing any issues related to "use_deprecated_read" for classic(non-portable) Flink but I think you are trying to use x-lang henceprobably need portable Flink.
Thanks,
Cham
I think that we should reiterate on the decision of deprecatingRead - if we can implement it via SDF, what is the reason toforbid a runner to make use of a simpler implementation? Theexpansion of Read might be runner dependent, that is something wedo all the time, or am I missing something?
  Jan

[1] https://issues.apache.org/jira/browse/BEAM-10940

On 7/25/21 1:38 AM, Chamikara Jayalath wrote:
I think we might be going down a bit of a rabbit hole with thesupport for "use_deprecated_read" for portable Flink :)
I think "use_deprecated_read" should be considered a stop-gapmeasure for Flink (and Spark ?) till we have proper support forSDF. In fact I don't think an arbitrary portable runner cansupport "use_deprecated_read" due to the following.
(1) SDK Harness is not aware of BoundedSource/UnboundedSource.Only source framework SDK Harness is aware of is SDF.(2) Invoking BoundedSource/UnboundedSource is not a part of the FnAPI(3) A non-Java Beam portable runner will probably not be able todirectly invoke legacy Read transforms similar to the way Flinkdoes today.
I don't think we should be breaking language agnostic API layers(for example, definition of model coders) just to support"use_deprecated_read".
Thanks,
Cham
On Sat, Jul 24, 2021 at 11:50 AM Jan Lukavský <[email protected]>wrote:
On 7/24/21 12:34 AM, Robert Bradshaw wrote:
On Thu, Jul 22, 2021 at 10:20 AM Jan Lukavský<[email protected]> wrote:
Hi,
this was a ride. But I managed to get that working. I'd like todiscuss two points, though:
a) I had to push Java coders to ModelCoders for Java (whichmakes sense to me, but is that correct?). See [1]. It is neededso that the Read transform (executed directly in TaskManager)can correctly communicate with Java SDK harness using customcoders (which is tested here [2]).
I think the intent was that ModelCoders represent the set of
language-agnostic in the model, though I have to admit I've always
been a bit fuzzy on when a coder must or must not be in that list.
I think that this definition works as long, as runner does notitselfinterfere with the Pipeline. Once the runner starts (by itself,not via
SdkHarnessClient) producing data, it starts to be part of the
environment, and therefore it should understand its own Coders. I'd
propose the definition of "model coders" to be Coders that theSDK is
able to understand, which then works naturally for the ModelCoders
located in "core-construction-java", that it should understandJavs SDK
Coders.
b) I'd strongly prefer if we moved the handling ofuse_deprecated_read from outside of the Read PTransformdirectly into expand method, see [3]. Though this is not neededfor the Read on Flink to work, it seems cleaner.
WDYT?
The default value of use_deprecated_read should depend on therunner
(e.g. some runners don't work well with it, others require it). As
such should not be visible to the PTransform's expand.
I think we should know what is the expected outcome. If a runnerdoes
not support primitive Read (and therefore use_deprecated_read), what
should we do, if we have such experiment set? Should the Pipelinefail,or should it be silently ignored? I think that we should fail,becauseuser expects something that cannot be fulfilled. Therefore, wehave two
options - handling the experiment explicitly in runners that do not
support it, or handle it explicitly in all cases (both supported and
unsupported). The latter case is when we force runners to callexplicitconversion method (convertPrimitiveRead....). Every runner thatdoes notsupport primitive Read must handle the experiment either way,becauseotherwise the experiment would be simply silently ignored, whichis not
exactly user-friendly.
   Jan
[1]https://github.com/apache/beam/pull/15181/commits/394ddc3fdbaacc805d8f7ce02ad2698953f34375
[2]https://github.com/apache/beam/pull/15181/files#diff-b1ec58edff6c096481ff336f6fc96e7ba5bcb740dff56c72606ff4f8f0bf85f3R201
[3]https://github.com/apache/beam/pull/15181/commits/f1d3fd0217e5513995a72e92f68fe3d1d665c5bb
On 7/18/21 6:29 PM, Jan Lukavský wrote:

Hi,
I was debugging the issue and it relates to pipeline fusion -it seems that the primitive Read transform gets fused and thenis 'missing' as source. I'm a little lost in the code, but themost strange parts are that:
a) I tried to reject fusion of primitive Read by addingGreedyPCollectionFusers::cannotFuse forPTransformTranslation.READ_TRANSFORM_URN toGreedyPCollectionFusers.URN_FUSIBILITY_CHECKERS, but thatdidn't change the exception
b) I tried adding Reshuffle.viaRandomKey between Read andPAssert, but that didn't change it either
c) when I run portable Pipeline with use_deprecated_read onFlink it actually runs (though it fails when it actually readsany data, but if the input is empty, the job runs), so it doesnot hit the same issue, which is a mystery to me
If anyone has any pointers that I can investigate, I'd bereally grateful.
Thanks in advance,

   Jan



On 7/16/21 2:00 PM, Jan Lukavský wrote:

Hi,
I hit another issue with the portable Flink runner. Long storyshort - reading from Kafka is not working in portable Flink.After solving issues with expansion service configuration(ability to add use_deprecated_read) option, because flinkportable runner has issues with SDF [1], [2]. After being ableto inject the use_deprecated_read into expansion service I wasable to get an execution DAG that has the UnboundedSource, butthen more and more issues appeared (probably related to missingLengthPrefixCoder somewhere - maybe at the output from theprimitive Read). I wanted to create a test for it and I foundout, that there actually is ReadSourcePortableTest inFlinkRunner, but _it tests nothing_. The problem is that Readis transformed to SDF, so this test tests the SDF, not the Readtransform. As a result, the Read transform does not work.
I tried using convertReadBasedSplittableDoFnsToPrimitiveReadsso that I could make the test fail and debug that, but I got into
java.lang.IllegalArgumentException: PCollectionNodes[PCollectionNode{id=PAssert$0/GroupGlobally/ParDo(ToSingletonIterables)/ParMultiDo(ToSingletonIterables).output,PCollection=unique_name:"PAssert$0/GroupGlobally/ParDo(ToSingletonIterables)/ParMultiDo(ToSingletonIterables).output"
coder_id: "IterableCoder"
is_bounded: BOUNDED
windowing_strategy_id: "WindowingStrategy(GlobalWindows)"
}] were consumed but never produced


which gave me the last knock-out. :)
My current impression is that starting from Beam 2.25.0,portable FlinkRunner is not able to read from Kafka. Couldsomeone give me a hint about what is wrong with usingconvertReadBasedSplittableDoFnsToPrimitiveReads in the test [3]?
   Jan

[1] https://issues.apache.org/jira/browse/BEAM-11991

[2] https://issues.apache.org/jira/browse/BEAM-11998

[3] https://github.com/apache/beam/pull/15181

Re: Primitive Read not working with Flink portable runner

Reply via email to