[
https://issues.apache.org/jira/browse/STORM-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15522581#comment-15522581
]
Arun Mahadevan commented on STORM-1961:
---------------------------------------
>Concepts and definitions
>Adding them to README is definitely important. But it is a bit difficult to
>review unless these concepts are carefully defined in the design >document
>also. Will help reviewers understand the general direction, benefits and
>constraints of the design.
Will update design doc.
>StreamBuilder
>Ok, it sounds like StreamBuilder is not the right name for it then. Topology
>or TopologyBuilder may be a better name. Seems similar to >TridentTopology
>class. The build() call can still be moved inside the submt() method ?
It could be moved. But there are around 7 overloaded methods for submitTopology
in StormSubmitter. Since the build() returns a StormTopology, any of those
methods can be used. If we want to directly pass the stream builder,
corresponding overloads will have to be newly added to directly accept a
StreamBuilder.
>The T in Stream<T>
>Do you mean to say the type of values passing through the stream does not
>change ? I see from your examples of mapToPair etc, that it is >possible for
>types to change as you run through the operators. I would think the input and
>output types would also change in case of >aggregations, windowing, etc.
Type a values within 'a' stream is same. When you do a stream.map/flatMap etc
you are getting a new stream (different from the stream that you initially
applied the function on) with a possibly different value type.
>flatMapArray
>Supporting arrays natively help perf due to reduced allocation and GC
>overhead. So it may superceede those concerns.
Generally streams are lazy and realize the values only when needed, so
Iterable/Iterator may be more apt. We could add a flatMapArray, but generally
in other streams api, haven't seen support for arrays.
>Branching
>The branching example in your doc differs from the example in your last
>comment here. Which one is correct ?
In the attached doc the branch operation example is,
Stream<T>[] streams = stream.branch(Predicate<T>... predicates) .
In the previous comment "With the branch(..) api, the result type is same as
input type. The values are just forwarded to one of the branches." which says
the same thing. Wheres the confusion?
>Parallelism Hints : Can you show examples of how parallism hints at spout/bolt
>level will then be controlled by the user ?
Its the "repartition" api. The word count example in the doc shows this (at
spout and bolt level). Theres also an example in the storm-starter (part of
PR). I will add more details in doc/README.
>Windowing & Batching
>Windows are logical groupings from the point of view of the needs of the
>business/user. Batching has to do with efficiency or providing >guarantees. I
>can think of 3 kinds of batching for Streaming:
>The units of IO performed by a spout(kafka). Basically buffered reads. And
>units of IO performed by a terminal bolt(Hbase/Hdfs). Basically >flushing of
>writes.
>Units in which things are moved around in the internal queues.
>Unit of processing/delivery (set of tuples) from the processing guarantee
>standpoint. Basically Trident micro-batches.
>None of these 3 types should get implicitly coupled with Windowing boundaries,
>or with each other.
>AFAICT what you refer to as batching (aggreation/join boundaries), is
>windowing itself. IMO those two terms should not be conflated.
>You can have 1hr or 1 day or even longer windows. Any type of batching at
>those intervals wont make sense. Then you can have overlapping >windows &
>sliding windows. Batching cannot be overlapping or sliding. A tuple can be
>part of multiple windows but part of only one batch.
Thats what I mentioned in the earlier comment. What you are explaining as unit
of transport is independent from windowing. Will explicitly call this out in
README/design doc.
>Microbatching support
>This statement in the design document :
>The idea is to provide a common typed api to express streaming transformations
>easily and to address both the stormcore and trident use >cases
>suggests you are trying to support micro-batching also.
>But there is no information on how it will be supported. Supporting
>microbatching, AFAICT, is a complex problem unto itself. Which also makes >me
>wonder about the rationale for the rejected alternative. I am thinking, you
>will face the similar issues in this API as well.
I meant addressing the exactly once semantics provided by trident. Trident
internally uses the batches to strictly order the commits to provide exactly
once by replaying the last batch with idempotent updates. Trident does not use
micro batching as a way to address the unit of transport batching. IMO, it
should be up to the spouts/bolts and the queues to implement the unit of
transport batching and transparent to the user's logic. Micro-batching should
be a separate discussion and not mixed with the streaming apis.
>Custom Operators
>IMO, how users can do this easily extend the APIs is something that may need
>upfront thinking as it will impact the design and interfaces. >Otherwise might
>end up with something that is hard for end users.
At a high level we could expose the processor apis. It can be added
incrementally since the processor interfaces are not currently exposed to the
user. So we should be able to change it if needed to address the custom
operators. But the custom operators should not affect the syntax of the
standard stream apis that we provide.
> Come up with streams api for storm core use cases
> -------------------------------------------------
>
> Key: STORM-1961
> URL: https://issues.apache.org/jira/browse/STORM-1961
> Project: Apache Storm
> Issue Type: Sub-task
> Reporter: Arun Mahadevan
> Assignee: Arun Mahadevan
> Attachments: UnifiedStreamapiforStorm.pdf
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)