Re: Verification about my observation for current State implementation

Jungtaek Lim Tue, 20 Feb 2018 16:40:52 -0800

> You can only change the parallelism (number of executors). AFAIK there is
no option to change the number of tasks and maybe I am missing if it
changed recently.


Yeah looks like I misunderstand the feature. It addresses number of
executors.

I'm wondering the benefit of having N tasks : 1 executor. It even makes me
confusing continuously, and I don't find beneficial use cases for that.
Most of things would be IO bound processing, and should be covered with
asynchronous processing. I hope we just simplify this one.

Another perspective of rebalance: suppose end users provide parallelism
bigger than current task count. If task count is not changed, rebalance is
not working as expected from users. If task count is changed, we are
breaking precondition of above statement.

We should take into consideration that end users modify topology code and
resubmit. In this case task count can be (and should be) changed, and we
should support state resharding in this case. Without resharding each task
fails to load all the KVs for keys routing to the task. It doesn't seem to
the issue about whether supporting dynamic task change or not.

- Jungtaek Lim (HeartSaVioR)

ps. btw, internal structure of Storm topology is not capable of scaling
task counts without restructuring since task id is increased sequentially
across components. I suspect it would block elasticity so that should be
also the thing to address.

2018년 2월 21일 (수) 오전 8:33, Arun Mahadevan <[email protected]>님이 작성:

> >>
> >>1. Right now users need to vary the database/table name in the state
> >provider config per topology.
> >
> >Redis Cluster doesn't support multi database and it is uncommon usage for
> >Redis to use even tens of databases, so end users can't do the needed
> thing
> >with Redis state backend. For HBase state backend they can do, but
> creating
> >table requires administrator permission of HBase, which someone may not
> >want to open to multiple users. So I feel including topology name in
> >namespace looks like "mandatory" rather than "good to have".
>
>
> I think this should be a fairly straightforward change and we should do it.
>
> >>
> >>2. As of now storm only supports rebalancing the tasks (task count
> >remains the same).
> >
> >That is not correct. We are still missed to provide the feature in UI (I'm
> >not familiar with front-end so I had in mind but I didn't work on that)
> but
> >you can rebalance with different parallelism of component via cli.
> >
> >Below is the rebalance command provided via cli. Quoting command line
> >client document[1]:
> >
> >rebalance
> >Syntax: storm rebalance topology-name [-w wait-time-secs] [-n
> >new-num-workers] [-e component=parallelism]*
>
>
> You can only change the parallelism (number of executors). AFAIK there is
> no option to change the number of tasks and maybe I am missing if it
> changed recently. As long as the number of tasks are fixed, we don’t have
> to worry about re-sharding the keys since the namespaces are at task level.
> Autoscaling may or may not change this design (we may only allow increasing
> or reducing the parallelism and not the tasks) and it requires more thought
> on how it would affect the user managed state at core spout/bolt level
> (outside of the higher level abstractions) and so IMO, we can tackle it
> later.
>
> Thanks,
> Arun
>
>
>
> On 2/20/18, 2:57 PM, "Jungtaek Lim" <[email protected]> wrote:
>
> >> 1. Right now users need to vary the database/table name in the state
> >provider config per topology.
> >
> >Redis Cluster doesn't support multi database and it is uncommon usage for
> >Redis to use even tens of databases, so end users can't do the needed
> thing
> >with Redis state backend. For HBase state backend they can do, but
> creating
> >table requires administrator permission of HBase, which someone may not
> >want to open to multiple users. So I feel including topology name in
> >namespace looks like "mandatory" rather than "good to have".
> >
> >> 2. As of now storm only supports rebalancing the tasks (task count
> >remains the same).
> >
> >That is not correct. We are still missed to provide the feature in UI (I'm
> >not familiar with front-end so I had in mind but I didn't work on that)
> but
> >you can rebalance with different parallelism of component via cli.
> >
> >Below is the rebalance command provided via cli. Quoting command line
> >client document[1]:
> >
> >rebalance
> >Syntax: storm rebalance topology-name [-w wait-time-secs] [-n
> >new-num-workers] [-e component=parallelism]*
> >
> >And we're sure that elasticity is "the thing" to provide, and for
> >elasticity we may even want to adjust parallelism on the fly (ideally,
> >would not that easy).
> >
> >> IMO, users should have the flexibility to store multiple key-values in
> >the state in the core API. ... I think this flexibility also helps us to
> >provide useful abstractions on top.
> >
> >I agree most beneficial aspect from core API is flexibility and less
> >restriction. I'm just worrying about state flexibility which should be
> >handled with caution. For example, without proper restriction (especially
> >grouping) duplicated keys in different tasks can happen, and even some end
> >users are brave to create their own state reshard tool, they need to deal
> >with merging the values for same key. Most end users may not recognize the
> >situation and hard to deal with, which doesn't seem friendly.
> >
> >I feel we need to at least document recommended usage (selecting key,
> using
> >field grouping with selected key, selecting same key as key of state KV)
> >with notice/caution effects for other usages, and also need to provide
> >state reshard tool for such usage. The tool can be used for states handled
> >with streams API.
> >
> >> 3. Re-sharding would be based on the keys (re-hashing). The component
> id,
> >task-id would be used to map to the namespace. So I am not clear about the
> >concern you have raised.
> >
> >Yes resharding would be based on the keys, which means keys should be
> moved
> >to different task id. So to move the key in state backend, we need to
> parse
> >and modify the namespace part, and we use delimiter which is allowed to
> >topology name as well as component name. (I don't think namespace is badly
> >composed. The problem is not restricting topology name and component name.
> >We are allowing too many flexibility and being stuck from that.)
> >
> >Windowed state is another thing we should deal with. I still didn't have
> >time to deep dive with, but to support resharding, window (and effectively
> >any states) should be grouped by key so that it can be placed to right
> >(might be new) task after relaunching topology. I'm asking that
> >current window/partition/windowsystem
> >is capable of.
> >
> >> IMO before we do this, we can introduce stateful exactly once processing
> >via the streams API as the first step.
> >
> >While I definitely agree that introducing stateful exactly-once processing
> >is major feature and the thing we should support sooner than later, I'm
> not
> >100% sure about which thing to do first. We're supporting at-least-once
> >stateful processing in 1.x version line and haven't provided "state
> >reshard" feature for a long time which makes actual users directly suffer,
> >so that's another major feature to me.
> >
> >- Jungtaek Lim (HeartSaVioR)
> >
> >1. http://storm.apache.org/releases/1.2.1/Command-line-client.html
> >
> >
> >2018년 2월 21일 (수) 오전 4:22, Arun Iyer <[email protected]>님이 작성:
> >
> >> correction: task count remains the same (executor count can vary).
> >>
> >>
> >>
> >>
> >> On 2/20/18, 11:20 AM, "Arun Iyer on behalf of Arun Mahadevan" <
> >> [email protected] on behalf of [email protected]> wrote:
> >>
> >> >Hi Jungtaek,
> >> >
> >> >
> >> >1. Right now users need to vary the database/table name in the state
> >> provider config per topology. Agree, its better to include topology in
> the
> >> namespace.
> >> >
> >> >2. IMO, users should have the flexibility to store multiple key-values
> in
> >> the state in the core API. As of now storm only supports rebalancing the
> >> tasks (executor count remains the same). So users need not worry about
> >> re-sharding their state as long as they use the right grouping. I think
> >> this flexibility also helps us to provide useful abstractions on top.
> >> >
> >> >3. Re-sharding would be based on the keys (re-hashing). The component
> id,
> >> task-id would be used to map to the namespace. So I am not clear about
> the
> >> concern you have raised.
> >> >
> >> >We could support dynamic state re-sharding via underlying higher level
> >> abstractions (e.g. Streams API) where we control the namespaces, keys
> etc
> >> and would be more manageable. IMO before we do this, we can introduce
> >> stateful exactly once processing via the streams API as the first step.
> >> >
> >> >Thanks,
> >> >Arun
> >> >
> >> >
> >> >
> >> >On 2/19/18, 4:59 PM, "Jungtaek Lim" <[email protected]> wrote:
> >> >
> >> >>Hi,
> >> >>
> >> >>I'd like to verify my observation on current State implementation is
> >> >>correct, so that we could fix them if necessary and make plan for
> >> >>improvement.
> >> >>
> >> >>1. State is stored with namespace prefix which typically composes to
> >> >>(component id, task id) pair and it doesn't look like having
> >> classification
> >> >>for topology. Is this correct observation? If then I think that's
> worth
> >> to
> >> >>call it as 'critical' and it must be fixed.
> >> >>
> >> >>2. We're allowing end-users to put key of state, and also no
> restriction
> >> >>for grouping on stateful component. I feel such flexibility breaks the
> >> >>possibility to reshard state and end-users are required to implement
> >> their
> >> >>own reshard tool according to their topology state key distribution
> >> logic.
> >> >>I expect it will not happen on streams API (since it should be done
> with
> >> >>keyed stream) but wouldn't it better to also restrict such flexibility
> >> also
> >> >>for core API?
> >> >>
> >> >>3. Suppose we are going to support state resharding (for allowing
> change
> >> of
> >> >>parallelism) and we restrict to apply field grouping with key while
> >> >>connecting stateful component.
> >> >>Then key-value can be moved based on key (though finding and replacing
> >> task
> >> >>id may not be trivial if component name has '-'... we have same issue
> on
> >> >>metric name, so maybe time to restrict characters on topology name as
> >> well
> >> >>as component name?).
> >> >>Is it also true for window/partition/windowsystem state? I didn't
> take a
> >> >>deep look on window state (I would find a time) but it would be great
> if
> >> >>someone knowing the detail makes it clear.
> >> >>
> >> >>Thanks in advance,
> >> >>Jungtaek Lim (HeartSaVioR)
> >> >
> >>
>
>

Re: Verification about my observation for current State implementation

Reply via email to