[
https://issues.apache.org/jira/browse/APEXMALHAR-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311525#comment-15311525
]
Siyuan Hua commented on APEXMALHAR-2099:
----------------------------------------
[~ilganeli] Also we do have plan for GroupBy, It's just not there yet :) We
need to build the operator fist.
You can follow this ticket
https://issues.apache.org/jira/browse/APEXMALHAR-2085 for details. Actually it
would address more problems.
In terms of high-level API, we categorize all transformations into 2 different
type windowed operations/non-windowed operations. As you might see there is a
WindowedStream interface as well(empty for now but methods will be added
later). For operations like map, filter, window is meaningless, but for some
operations like group, join etc window is a must have(by default it might be
one globalwindow, but still there is a window for that). Further more, we would
divide windowed operations into 3 different types: group, combine, join. Group
needs to retain the tuples in the window(sorting for example), Combine needs to
combine the tuples in the window into one/multiple tuples(max, sum, reduce,
topN etc) and Join needs to merge multiple upstream inputs into one. All in
all, we do have detail plan for that.
Now let's look at the Beam API again, personally I think there are a lot of
mess there, I also looked at all PTransform there(The categorization I have
above actually comes from the analysis of all PTransforms :) ). They don't
have clear hierarchy in their PTransform family. For example window is one of
transformations, that is really against my interpretation of the word
"Transformation". Window alone is meaningless, what we actually want to do is
Sum, Max, Average in a specific Window. Like you said, it's just a strategy in
our next cut of Stream API.
And then, here you will have abstraction in different layers eventually, A
simple PTransform base class DOES NOT work. The reason we categorize
operations into windowed and non-windowed and further break them into
Group/Combine/Join is because operations in same category usually share similar
internal mechanism like state checkpointing, trigger behaviour so on so forth.
You don't want to do repeat work for those. And Beam actually do the same
thing. For example, if you want to do Sum, there is no such thing called
SumTransform, instead they have a number of Sum**Fn classes which implements
CombineFn class which is finally used by some PTransform. We will have
similar method in WindowedStream interface which takes a CombineFn instead of
Sum, Reduce (or maybe have them all) etc.
Thanks for your opinion, let's together make our API rock!
> Identify overlap between Beam API and existing Apex Stream API
> --------------------------------------------------------------
>
> Key: APEXMALHAR-2099
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2099
> Project: Apache Apex Malhar
> Issue Type: Sub-task
> Reporter: Ilya Ganelin
>
> There should be some overlap between the Beam API and the recently released
> Apex Stream API. This task captures the need to understand and document this
> overlap.
> AC:
> * A document or JIRA comment identifying which components of the Beam API are
> implement, similar, or absent within the Apex Stream API.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)