Re: Unexpected in TestStream in Portable Mode

Jan Lukavský Tue, 31 Aug 2021 15:54:30 -0700

On 9/1/21 12:41 AM, Ke Wu wrote:

Read does not have translation in portability, so the implementationis that it needs to be primitive transform explicitly implemented bythe runner. The encoding/decoding has to happen in the runner.
Could you help me understand this a bit more? IIRC, Read is *NOT*being translated in portable mode exactly means it is a compositetransform instead of primitive because all primitive transforms arerequired to be translated. In addition, Read is a composite transformof Impulse, which produces dummy bytes [1] to trigger subsequentParDo/ExecutableStage, where decoding the actual source happens [2]

Sorry, I was referring to the use_deprecated_read Read transform (notSDF). That is primitive Read which has no translation on the SDK harnessside. [1]

[1]https://lists.apache.org/thread.html/r42284d641a133ead6d80a5af01ac8bd4e01f1fba4197d0018f092f52%40%3Cdev.beam.apache.org%3E

There seems to be no role of the SDK harness with regard to theTestStream, because the elements are already encoded by thesubmitting SDK. The coders must match nevertheless, because you canhave Events of type KV<KV<WindowedValue<Integer, Object>>> and whatwill and what will not get length-prefixed depends on which partsexactly are "known" (model) coders and which are not. Encoding thewhole value as single byte array will not work for the consuming SDKharness, which will see that there should be nested KvCoders instead.
I don’t think I fully understand what you say here. TestStream iscurrently a primitive transform, therefore there is no role of SDKharness. This is what the proposal to change, to make TestStream acomposite transform with a primitive transform and subsequent ParDo todecode to the desired format.

The problem is that the _desired format_ depends on the (source) coder.There is different representation of SDK coder and model coder. Thelatter will not be (itself) length-prefixed, but will be recursivelyintrospected and length-prefixed will be only really unknown sub-coders.That is why the coder of elements of TestStream has to be known to therunner.

[1]https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Impulse.java#L39<https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Impulse.java#L39>[2]https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L149<https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L149>
On Aug 31, 2021, at 3:21 PM, Jan Lukavský <[email protected]<mailto:[email protected]>> wrote:
On 9/1/21 12:13 AM, Ke Wu wrote:
Hi Jan,

Here is my understanding,
Runner is being brought up by job server driver, which is up andrunning before the job submission, i.e. it is job agnostic.Therefore, the runner it brought up does not have any SDK coderavailable and artifact staging only happens for SDK workers.
You are right that Read and TestStream are sources, however the onething that distinguish them is that Read transform is a compositetransform and the decoding happens in ParDo/ExecutableStage, i.e. onSDK worker.
Read does not have translation in portability, so the implementationis that it needs to be primitive transform explicitly implemented bythe runner. The encoding/decoding has to happen in the runner.
The proposal here is also to make the public facing TestStreamtransform a composite transform instead of primitive now, so thatthe decoding would occur on the SDK worker side where SDK coder isavailable, and the primitive that powers TestStream, which will bedirectly translated by runner to always produce raw bytes, and theseraw bytes will be decoded on the SDK worker side.
There seems to be no role of the SDK harness with regard to theTestStream, because the elements are already encoded by thesubmitting SDK. The coders must match nevertheless, because you canhave Events of type KV<KV<WindowedValue<Integer, Object>>> and whatwill and what will not get length-prefixed depends on which partsexactly are "known" (model) coders and which are not. Encoding thewhole value as single byte array will not work for the consuming SDKharness, which will see that there should be nested KvCoders instead.
Best,
Ke
On Aug 31, 2021, at 2:56 PM, Jan Lukavský <[email protected]<mailto:[email protected]>> wrote:
Sorry if I'm missing something obvious, but I don't quite see thedifference between Read and TestStream regarding the discussedissue with coders. Couple of thoughts:
a) both Read and TestStream are _sources_ - they produce elementsthat are consumed by downstream transforms
b) the coder of a particular PCollection is defined by thePipeline proto - it is the (client side) SDK that owns the Pipelineand that defines all the coders
c) runners must adhere to these coders, because otherwise there isrisk of coder mismatch, most probably on edges like x-langtransforms or inlined transforms
I tried the approach of encoding the output of Read into byte arrayas well, but that turns out to have the problem that once there isa (partially) known coder in play, this does not work, because theconsuming transform (executable stage) expects to see the wirecoder - that is not simply byte array, because the type of elementsmight be for instance KV<K, V>, where KvCoder is one ofModelCoders. That does not encode using LengthPrefixCoder and assuch will be incompatible with LengthPrefixCoder(ByteArrayCoder).The TestStream needs to know the coder of elements, because thatdefines where exactly must or must not be insertedlength-prefixing. The logic in LengthPrefixUnknownCoders [1] isrecursive for ModelCoders.
[1]https://github.com/apache/beam/blob/ff70e740a2155592dfcb302ff6303cc19660a268/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/wire/LengthPrefixUnknownCoders.java#L48
On 8/31/21 11:29 PM, Ke Wu wrote:
Awesome! Thank you Luke and Robert.
Also created https://issues.apache.org/jira/browse/BEAM-12828<https://issues.apache.org/jira/browse/BEAM-12828> to track unittest conversion. I could take it after I updated Samza runner tosupport TestStream in portable mode.
On Aug 31, 2021, at 2:05 PM, Robert Bradshaw <[email protected]<mailto:[email protected]>> wrote:
Created https://issues.apache.org/jira/browse/BEAM-12827<https://issues.apache.org/jira/browse/BEAM-12827> to track this.
+1 to converting tests to just use longs for better coverage for now.

Also, yes, this is very similar to the issues encountered by Reads,
but the solution is a bit simpler as there's no need for the
TestStream primitive to interact with the decoded version of the
elements (unlike Reads, where the sources often give elements in
un-encoded form) and no user code to run.

- Robert
On Tue, Aug 31, 2021 at 11:00 AM Jan Lukavský <[email protected]<mailto:[email protected]>> wrote:
This looks (and likely has the same cause) similar to what Ihave experienced when making primitive Read supported by Flink.The final solution would be to make SDK coders known to therunner of the same SDK (already present in various differentthreads). But until then, the solution seems to be somethinglike [1]. The root cause is that the executable stage expectsits input to be encoded by the SDK harness, and that part ismissing when the transform is inlined (like Read in my case, orTestStream in your case). The intoWireTypes method simulatesprecisely this part - it encodes the PCollection via coderdefined in the SDK harness and then decodes it by coder definedby the runner (which match on binary level, but producedifferent types).
Jan
[1]https://github.com/apache/beam/blob/dd7945f9f259a2989f9396d1d7a8dcb122711a52/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkStreamingPortablePipelineTranslator.java#L657<https://github.com/apache/beam/blob/dd7945f9f259a2989f9396d1d7a8dcb122711a52/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkStreamingPortablePipelineTranslator.java#L657>
On 8/31/21 7:27 PM, Luke Cwik wrote:
I originally wasn't for making it a composite because it changesthe "graph" structure but the more I thought about it the more Ilike it.
On Tue, Aug 31, 2021 at 10:06 AM Robert Bradshaw<[email protected] <mailto:[email protected]>> wrote:
On Tue, Aug 31, 2021 at 9:18 AM Luke Cwik <[email protected]<mailto:[email protected]>> wrote:
On Mon, Aug 30, 2021 at 7:07 PM Ke Wu <[email protected]<mailto:[email protected]>> wrote:
Hello everyone,
This is Ke. I am working on enable TestStream support forSamza Runner in portable mode and discovers something unexpected.
In my implementation for Samza Runner, couple of tests arefailing with errors like
java.lang.ClassCastException: java.lang.Integer cannot becast to [B
I noticed these tests have the same symptom on Flink Runneras well, which are currently excluded:
https://issues.apache.org/jira/browse/BEAM-12048<https://issues.apache.org/jira/browse/BEAM-12048>
https://issues.apache.org/jira/browse/BEAM-12050
After some more digging, I realized that it is because thecombination of following facts:
TestStream is a primitive transform, therefore, Runners aresupposed to translate directly, the most intuitiveimplementation for each runner to do is to parse the payloadto decode TestStream.Event [1] on the Runner process to behanded over to subsequent stages.When TestStream used with Integers, i.e. VarIntCoder toinitialize, since VarIntCoder is NOT a registered ModelCoder[2], it will be treated as custom coder during conversion toprotobuf pipeline [3] and will be replaced with byte arraycoder [4] when runner sends data to SDK worker.Therefore an error occurs because the decodedTestStream.Event has Integer as its value but the remoteinput receiver is expecting byte array, causingjava.lang.ClassCastException: java.lang.Integer cannot becast to [B
In addition, I tried to update all these failed tests to useLong instead of Integer, and all tests will pass sinceVarLongCoder is a known coder. I do understand that runnerprocess does not have user artifacts staged so it can onlyuse coders in beam model when communicating with SDK workerprocess.
Couple of questions on this:

1. Is it expected that VarIntegerCoder is not a known coder?
Yes since no one has worked to make it a well known coder.
The notion of "integer" vs. "long" is also language-specificdetail as
well, so not sure it makes sense as a well-known coder.
It can be made a well known coder and this would solve theimmediate problem but not the long term issue of portableTestStream not supporting arbitrary types.
+1. Rather than making coder a property of TestStream, I wouldbe infavor of the TestStream primitive always producing bytes(basically,
by definition), and providing a composite that consists of this
followed by a decoding to give us a typed TestStream.
2. Is TestStream always supposed to be translated the payloadas raw bytes in order that runner process can always send itto SDK worker with the default byte array coder and asks SDKworker to decode accordingly?
Having the runner treat it always as bytes and not T is likelythe best solution but isn't necessary.
3. If Yes to 2), then does it mean, TestStream needs to betranslated in a completely different way in portable modefrom classic mode since in classic mode, translator candirectly translates the payload to its final format.
There are a few ways to fix the current implementation to workfor all types. One way would be if we required theencoded_element to be the "nested" encoding and then ensuredthat the runner uses a WindowedValue<ByteArrayCoder in outercontext> and the SDK used WindowedValue<T> (note that thisisn't WindowedValue<LengthPrefix<T>>) for the wire coders.This is quite annoying cause the runner inserts lengthprefixing in a lot of places (effectively every time it seesan unknown type) so we would need to special case this andpropagate this correction through any runner native transforms(e.g. GBK) until the SDK consumes it.
Another way would be to ensure that the SDK always usesLengthPrefix<T> as the PCollection encoding and theencoded_element format. This would mean that the runner cantranslate it to a T if it so chooses and won't have theannoying special case propagation logic. This leaks the lengthprefixing into the SDK at graph construction time which is notwhat it was meant for.
Swapping to use an existing well known type is by far theeasiest approach as you had discovered and won't impact thecorrectness of the tests.
Best,
Ke
[1]https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/TestStreamTranslation.java#L52<https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/TestStreamTranslation.java#L52>[2]https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/ModelCoderRegistrar.java#L65<https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/ModelCoderRegistrar.java#L65>[3]https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/CoderTranslation.java#L99<https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/CoderTranslation.java#L99>[4]https://github.com/apache/beam/blob/master/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/wire/WireCoders.java#L93<https://github.com/apache/beam/blob/master/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/wire/WireCoders.java#L93>

Re: Unexpected in TestStream in Portable Mode

Reply via email to