On 02.01.2019 23:44, John Roesler wrote:
> However, you seem to have a strong intuition that the scatter/gather
> approach is better.
> Is this informed by your actual applications at work? Perhaps you can
> provide an example
> data set and sequence of operations so we can all do the math and agree
> with you.
> It seems like we should have a convincing efficiency argument before
> choosing a more
> complicated API over a simpler one.

The way I see this is simple. If we only provide the basic 
implementation of 1:n join (repartition by FK, Range scan on Foreign 
table update). Then this is such a fundamental building block.

I do A join B.

a.map(retain(key and FK)).tomanyJoin(B).groupBy(a.key()).join(A). This 
pretty much performs all your "wire saving optimisations". I don't know! 
to be honest if someone did put this ContextAwareMapper() that was 
discussed at some point. Then I could actually do the high watermark 
thing. a.contextMap(reatain(key, fk and offset). 
omanyJoin(B).aggregate(a.key(), oldest offset wins).join(A).
I don't find the KIP though. I guess it didn't make it.

After the repartition and the range read the abstraction just becomes to 
weak. I just showed that your implementation is my implementation with 
stuff around it.

I don't know if your scatter gather thing is in code somewhere. If the 
join will only be applied after the gather phase I really wonder where 
we get the other record from? do you also persist the foreign table on 
the original side? If that is put into code somewhere already?

This would essentially bring B to each of the A's tasks. Factors for 
this in my case a rather easy and dramatic. Nevertheless an approach I 
would appreciate. In Hive this could be something closely be related to 
the concept of a MapJoin. Something I whish we had in streams. I often 
stated that at some point we need unbounded ammount off offsets per 
topicpartition and group :D Sooooo good.

Long story short. I hope you can follow my line of thought. I hope you 
can clarify my missunderstanding how the join is performed on A side 
without materializing B there.

I would love if streams would get it right. The basic rule I always say 
is do what Hive does. done.


>
> Last thought:
>> Regarding what will be observed. I consider it a plus that all events
>> that are in the inputs have an respective output. Whereas your solution
>> might "swallow" events.
>
> I didn't follow this. Following Adam's example, we have two join results: a
> "dead" one and
> a "live" one. If we get the dead one first, both solutions emit it,
> followed by the live result.

there might be multiple dead once in flight right? But it doesn't really 
matter, I never did something with the extra benefit i mentioned.

Reply via email to