>>>[Jungtaek]
>>2. As of now storm only supports rebalancing the tasks (task count
>>remains the same).
>>
>>That is not correct. We are still missed to provide the feature in UI (I'm
>>not familiar with front-end so I had in mind but I didn't work on that)
but
>>you can rebalance with different parallelism of component via cli.
>>
>>Below is the rebalance command provided via cli. Quoting command line
>>client document[1]:
>>
>>rebalance
>>Syntax: storm rebalance topology-name [-w wait-time-secs] [-n
>>new-num-workers] [-e component=parallelism]*


>[Arun] You can only change the parallelism (number of executors). AFAIK
there is no option to change the number of tasks and maybe I am missing if
it changed recently. As long as the number of tasks are >fixed, we don’t
have to worry about re-sharding the keys since the namespaces are at task
level. Autoscaling may or may not change this design (we may only allow
increasing or reducing the >parallelism and not the tasks) and it requires
more thought on how it would affect the user managed state at core
spout/bolt level (outside of the higher level abstractions) and so IMO, we
can tackle it >later.

That is right. Incase of rebalance, parallelism is about changing the no of
executors  and no of tasks remains same.



On Wed, Feb 21, 2018 at 5:03 AM, Arun Mahadevan <ar...@apache.org> wrote:

> >>
> >>1. Right now users need to vary the database/table name in the state
> >provider config per topology.
> >
> >Redis Cluster doesn't support multi database and it is uncommon usage for
> >Redis to use even tens of databases, so end users can't do the needed
> thing
> >with Redis state backend. For HBase state backend they can do, but
> creating
> >table requires administrator permission of HBase, which someone may not
> >want to open to multiple users. So I feel including topology name in
> >namespace looks like "mandatory" rather than "good to have".
>
>
> I think this should be a fairly straightforward change and we should do it.
>
> >>
> >>2. As of now storm only supports rebalancing the tasks (task count
> >remains the same).
> >
> >That is not correct. We are still missed to provide the feature in UI (I'm
> >not familiar with front-end so I had in mind but I didn't work on that)
> but
> >you can rebalance with different parallelism of component via cli.
> >
> >Below is the rebalance command provided via cli. Quoting command line
> >client document[1]:
> >
> >rebalance
> >Syntax: storm rebalance topology-name [-w wait-time-secs] [-n
> >new-num-workers] [-e component=parallelism]*
>
>
> You can only change the parallelism (number of executors). AFAIK there is
> no option to change the number of tasks and maybe I am missing if it
> changed recently. As long as the number of tasks are fixed, we don’t have
> to worry about re-sharding the keys since the namespaces are at task level.
> Autoscaling may or may not change this design (we may only allow increasing
> or reducing the parallelism and not the tasks) and it requires more thought
> on how it would affect the user managed state at core spout/bolt level
> (outside of the higher level abstractions) and so IMO, we can tackle it
> later.
>
> Thanks,
> Arun
>
>
>
> On 2/20/18, 2:57 PM, "Jungtaek Lim" <kabh...@gmail.com> wrote:
>
> >> 1. Right now users need to vary the database/table name in the state
> >provider config per topology.
> >
> >Redis Cluster doesn't support multi database and it is uncommon usage for
> >Redis to use even tens of databases, so end users can't do the needed
> thing
> >with Redis state backend. For HBase state backend they can do, but
> creating
> >table requires administrator permission of HBase, which someone may not
> >want to open to multiple users. So I feel including topology name in
> >namespace looks like "mandatory" rather than "good to have".
> >
> >> 2. As of now storm only supports rebalancing the tasks (task count
> >remains the same).
> >
> >That is not correct. We are still missed to provide the feature in UI (I'm
> >not familiar with front-end so I had in mind but I didn't work on that)
> but
> >you can rebalance with different parallelism of component via cli.
> >
> >Below is the rebalance command provided via cli. Quoting command line
> >client document[1]:
> >
> >rebalance
> >Syntax: storm rebalance topology-name [-w wait-time-secs] [-n
> >new-num-workers] [-e component=parallelism]*
> >
> >And we're sure that elasticity is "the thing" to provide, and for
> >elasticity we may even want to adjust parallelism on the fly (ideally,
> >would not that easy).
> >
> >> IMO, users should have the flexibility to store multiple key-values in
> >the state in the core API. ... I think this flexibility also helps us to
> >provide useful abstractions on top.
> >
> >I agree most beneficial aspect from core API is flexibility and less
> >restriction. I'm just worrying about state flexibility which should be
> >handled with caution. For example, without proper restriction (especially
> >grouping) duplicated keys in different tasks can happen, and even some end
> >users are brave to create their own state reshard tool, they need to deal
> >with merging the values for same key. Most end users may not recognize the
> >situation and hard to deal with, which doesn't seem friendly.
> >
> >I feel we need to at least document recommended usage (selecting key,
> using
> >field grouping with selected key, selecting same key as key of state KV)
> >with notice/caution effects for other usages, and also need to provide
> >state reshard tool for such usage. The tool can be used for states handled
> >with streams API.
> >
> >> 3. Re-sharding would be based on the keys (re-hashing). The component
> id,
> >task-id would be used to map to the namespace. So I am not clear about the
> >concern you have raised.
> >
> >Yes resharding would be based on the keys, which means keys should be
> moved
> >to different task id. So to move the key in state backend, we need to
> parse
> >and modify the namespace part, and we use delimiter which is allowed to
> >topology name as well as component name. (I don't think namespace is badly
> >composed. The problem is not restricting topology name and component name.
> >We are allowing too many flexibility and being stuck from that.)
> >
> >Windowed state is another thing we should deal with. I still didn't have
> >time to deep dive with, but to support resharding, window (and effectively
> >any states) should be grouped by key so that it can be placed to right
> >(might be new) task after relaunching topology. I'm asking that
> >current window/partition/windowsystem
> >is capable of.
> >
> >> IMO before we do this, we can introduce stateful exactly once processing
> >via the streams API as the first step.
> >
> >While I definitely agree that introducing stateful exactly-once processing
> >is major feature and the thing we should support sooner than later, I'm
> not
> >100% sure about which thing to do first. We're supporting at-least-once
> >stateful processing in 1.x version line and haven't provided "state
> >reshard" feature for a long time which makes actual users directly suffer,
> >so that's another major feature to me.
> >
> >- Jungtaek Lim (HeartSaVioR)
> >
> >1. http://storm.apache.org/releases/1.2.1/Command-line-client.html
> >
> >
> >2018년 2월 21일 (수) 오전 4:22, Arun Iyer <ai...@hortonworks.com>님이 작성:
> >
> >> correction: task count remains the same (executor count can vary).
> >>
> >>
> >>
> >>
> >> On 2/20/18, 11:20 AM, "Arun Iyer on behalf of Arun Mahadevan" <
> >> ai...@hortonworks.com on behalf of ar...@apache.org> wrote:
> >>
> >> >Hi Jungtaek,
> >> >
> >> >
> >> >1. Right now users need to vary the database/table name in the state
> >> provider config per topology. Agree, its better to include topology in
> the
> >> namespace.
> >> >
> >> >2. IMO, users should have the flexibility to store multiple key-values
> in
> >> the state in the core API. As of now storm only supports rebalancing the
> >> tasks (executor count remains the same). So users need not worry about
> >> re-sharding their state as long as they use the right grouping. I think
> >> this flexibility also helps us to provide useful abstractions on top.
> >> >
> >> >3. Re-sharding would be based on the keys (re-hashing). The component
> id,
> >> task-id would be used to map to the namespace. So I am not clear about
> the
> >> concern you have raised.
> >> >
> >> >We could support dynamic state re-sharding via underlying higher level
> >> abstractions (e.g. Streams API) where we control the namespaces, keys
> etc
> >> and would be more manageable. IMO before we do this, we can introduce
> >> stateful exactly once processing via the streams API as the first step.
> >> >
> >> >Thanks,
> >> >Arun
> >> >
> >> >
> >> >
> >> >On 2/19/18, 4:59 PM, "Jungtaek Lim" <kabh...@gmail.com> wrote:
> >> >
> >> >>Hi,
> >> >>
> >> >>I'd like to verify my observation on current State implementation is
> >> >>correct, so that we could fix them if necessary and make plan for
> >> >>improvement.
> >> >>
> >> >>1. State is stored with namespace prefix which typically composes to
> >> >>(component id, task id) pair and it doesn't look like having
> >> classification
> >> >>for topology. Is this correct observation? If then I think that's
> worth
> >> to
> >> >>call it as 'critical' and it must be fixed.
> >> >>
> >> >>2. We're allowing end-users to put key of state, and also no
> restriction
> >> >>for grouping on stateful component. I feel such flexibility breaks the
> >> >>possibility to reshard state and end-users are required to implement
> >> their
> >> >>own reshard tool according to their topology state key distribution
> >> logic.
> >> >>I expect it will not happen on streams API (since it should be done
> with
> >> >>keyed stream) but wouldn't it better to also restrict such flexibility
> >> also
> >> >>for core API?
> >> >>
> >> >>3. Suppose we are going to support state resharding (for allowing
> change
> >> of
> >> >>parallelism) and we restrict to apply field grouping with key while
> >> >>connecting stateful component.
> >> >>Then key-value can be moved based on key (though finding and replacing
> >> task
> >> >>id may not be trivial if component name has '-'... we have same issue
> on
> >> >>metric name, so maybe time to restrict characters on topology name as
> >> well
> >> >>as component name?).
> >> >>Is it also true for window/partition/windowsystem state? I didn't
> take a
> >> >>deep look on window state (I would find a time) but it would be great
> if
> >> >>someone knowing the detail makes it clear.
> >> >>
> >> >>Thanks in advance,
> >> >>Jungtaek Lim (HeartSaVioR)
> >> >
> >>
>
>

Reply via email to