Adding to what Priyank said, other cases like, rebalance allows to add or
reduce no of workers based on their usecase and no of executors can be
added/reduced accordingly.

~Satish.

On Wed, Feb 21, 2018 at 6:54 AM, Priyank Shah <[email protected]> wrote:

> Jungtaek,
>
> I read about rebalancing to increase the parallelism a long time back.
> Trying to remember and see if it answers your question of "N tasks 1
> executor". I think the use case for rebalancing parallelism was for users
> who are anticipating an increase in hardware capacity like adding cores to
> their worker node. In this case they can leverage these extra cores by
> rebalancing with higher number of executors instead of killing and
> re-deploying the topology. However, if they had the number of tasks =
> number of executors when they first submitted the topology they would not
> be able to leverage that since the precondition is "number of tasks >=
> number of executors".
>
> On 2/20/18, 4:40 PM, "Jungtaek Lim" <[email protected]> wrote:
>
>     > You can only change the parallelism (number of executors). AFAIK
> there is
>     no option to change the number of tasks and maybe I am missing if it
>     changed recently.
>
>     Yeah looks like I misunderstand the feature. It addresses number of
>     executors.
>
>     I'm wondering the benefit of having N tasks : 1 executor. It even
> makes me
>     confusing continuously, and I don't find beneficial use cases for that.
>     Most of things would be IO bound processing, and should be covered with
>     asynchronous processing. I hope we just simplify this one.
>
>     Another perspective of rebalance: suppose end users provide parallelism
>     bigger than current task count. If task count is not changed,
> rebalance is
>     not working as expected from users. If task count is changed, we are
>     breaking precondition of above statement.
>
>     We should take into consideration that end users modify topology code
> and
>     resubmit. In this case task count can be (and should be) changed, and
> we
>     should support state resharding in this case. Without resharding each
> task
>     fails to load all the KVs for keys routing to the task. It doesn't
> seem to
>     the issue about whether supporting dynamic task change or not.
>
>     - Jungtaek Lim (HeartSaVioR)
>
>     ps. btw, internal structure of Storm topology is not capable of scaling
>     task counts without restructuring since task id is increased
> sequentially
>     across components. I suspect it would block elasticity so that should
> be
>     also the thing to address.
>
>     2018년 2월 21일 (수) 오전 8:33, Arun Mahadevan <[email protected]>님이 작성:
>
>     > >>
>     > >>1. Right now users need to vary the database/table name in the
> state
>     > >provider config per topology.
>     > >
>     > >Redis Cluster doesn't support multi database and it is uncommon
> usage for
>     > >Redis to use even tens of databases, so end users can't do the
> needed
>     > thing
>     > >with Redis state backend. For HBase state backend they can do, but
>     > creating
>     > >table requires administrator permission of HBase, which someone may
> not
>     > >want to open to multiple users. So I feel including topology name in
>     > >namespace looks like "mandatory" rather than "good to have".
>     >
>     >
>     > I think this should be a fairly straightforward change and we should
> do it.
>     >
>     > >>
>     > >>2. As of now storm only supports rebalancing the tasks (task count
>     > >remains the same).
>     > >
>     > >That is not correct. We are still missed to provide the feature in
> UI (I'm
>     > >not familiar with front-end so I had in mind but I didn't work on
> that)
>     > but
>     > >you can rebalance with different parallelism of component via cli.
>     > >
>     > >Below is the rebalance command provided via cli. Quoting command
> line
>     > >client document[1]:
>     > >
>     > >rebalance
>     > >Syntax: storm rebalance topology-name [-w wait-time-secs] [-n
>     > >new-num-workers] [-e component=parallelism]*
>     >
>     >
>     > You can only change the parallelism (number of executors). AFAIK
> there is
>     > no option to change the number of tasks and maybe I am missing if it
>     > changed recently. As long as the number of tasks are fixed, we don’t
> have
>     > to worry about re-sharding the keys since the namespaces are at task
> level.
>     > Autoscaling may or may not change this design (we may only allow
> increasing
>     > or reducing the parallelism and not the tasks) and it requires more
> thought
>     > on how it would affect the user managed state at core spout/bolt
> level
>     > (outside of the higher level abstractions) and so IMO, we can tackle
> it
>     > later.
>     >
>     > Thanks,
>     > Arun
>     >
>     >
>     >
>     > On 2/20/18, 2:57 PM, "Jungtaek Lim" <[email protected]> wrote:
>     >
>     > >> 1. Right now users need to vary the database/table name in the
> state
>     > >provider config per topology.
>     > >
>     > >Redis Cluster doesn't support multi database and it is uncommon
> usage for
>     > >Redis to use even tens of databases, so end users can't do the
> needed
>     > thing
>     > >with Redis state backend. For HBase state backend they can do, but
>     > creating
>     > >table requires administrator permission of HBase, which someone may
> not
>     > >want to open to multiple users. So I feel including topology name in
>     > >namespace looks like "mandatory" rather than "good to have".
>     > >
>     > >> 2. As of now storm only supports rebalancing the tasks (task count
>     > >remains the same).
>     > >
>     > >That is not correct. We are still missed to provide the feature in
> UI (I'm
>     > >not familiar with front-end so I had in mind but I didn't work on
> that)
>     > but
>     > >you can rebalance with different parallelism of component via cli.
>     > >
>     > >Below is the rebalance command provided via cli. Quoting command
> line
>     > >client document[1]:
>     > >
>     > >rebalance
>     > >Syntax: storm rebalance topology-name [-w wait-time-secs] [-n
>     > >new-num-workers] [-e component=parallelism]*
>     > >
>     > >And we're sure that elasticity is "the thing" to provide, and for
>     > >elasticity we may even want to adjust parallelism on the fly
> (ideally,
>     > >would not that easy).
>     > >
>     > >> IMO, users should have the flexibility to store multiple
> key-values in
>     > >the state in the core API. ... I think this flexibility also helps
> us to
>     > >provide useful abstractions on top.
>     > >
>     > >I agree most beneficial aspect from core API is flexibility and less
>     > >restriction. I'm just worrying about state flexibility which should
> be
>     > >handled with caution. For example, without proper restriction
> (especially
>     > >grouping) duplicated keys in different tasks can happen, and even
> some end
>     > >users are brave to create their own state reshard tool, they need
> to deal
>     > >with merging the values for same key. Most end users may not
> recognize the
>     > >situation and hard to deal with, which doesn't seem friendly.
>     > >
>     > >I feel we need to at least document recommended usage (selecting
> key,
>     > using
>     > >field grouping with selected key, selecting same key as key of
> state KV)
>     > >with notice/caution effects for other usages, and also need to
> provide
>     > >state reshard tool for such usage. The tool can be used for states
> handled
>     > >with streams API.
>     > >
>     > >> 3. Re-sharding would be based on the keys (re-hashing). The
> component
>     > id,
>     > >task-id would be used to map to the namespace. So I am not clear
> about the
>     > >concern you have raised.
>     > >
>     > >Yes resharding would be based on the keys, which means keys should
> be
>     > moved
>     > >to different task id. So to move the key in state backend, we need
> to
>     > parse
>     > >and modify the namespace part, and we use delimiter which is
> allowed to
>     > >topology name as well as component name. (I don't think namespace
> is badly
>     > >composed. The problem is not restricting topology name and
> component name.
>     > >We are allowing too many flexibility and being stuck from that.)
>     > >
>     > >Windowed state is another thing we should deal with. I still didn't
> have
>     > >time to deep dive with, but to support resharding, window (and
> effectively
>     > >any states) should be grouped by key so that it can be placed to
> right
>     > >(might be new) task after relaunching topology. I'm asking that
>     > >current window/partition/windowsystem
>     > >is capable of.
>     > >
>     > >> IMO before we do this, we can introduce stateful exactly once
> processing
>     > >via the streams API as the first step.
>     > >
>     > >While I definitely agree that introducing stateful exactly-once
> processing
>     > >is major feature and the thing we should support sooner than later,
> I'm
>     > not
>     > >100% sure about which thing to do first. We're supporting
> at-least-once
>     > >stateful processing in 1.x version line and haven't provided "state
>     > >reshard" feature for a long time which makes actual users directly
> suffer,
>     > >so that's another major feature to me.
>     > >
>     > >- Jungtaek Lim (HeartSaVioR)
>     > >
>     > >1. http://storm.apache.org/releases/1.2.1/Command-line-client.html
>     > >
>     > >
>     > >2018년 2월 21일 (수) 오전 4:22, Arun Iyer <[email protected]>님이 작성:
>     > >
>     > >> correction: task count remains the same (executor count can vary).
>     > >>
>     > >>
>     > >>
>     > >>
>     > >> On 2/20/18, 11:20 AM, "Arun Iyer on behalf of Arun Mahadevan" <
>     > >> [email protected] on behalf of [email protected]> wrote:
>     > >>
>     > >> >Hi Jungtaek,
>     > >> >
>     > >> >
>     > >> >1. Right now users need to vary the database/table name in the
> state
>     > >> provider config per topology. Agree, its better to include
> topology in
>     > the
>     > >> namespace.
>     > >> >
>     > >> >2. IMO, users should have the flexibility to store multiple
> key-values
>     > in
>     > >> the state in the core API. As of now storm only supports
> rebalancing the
>     > >> tasks (executor count remains the same). So users need not worry
> about
>     > >> re-sharding their state as long as they use the right grouping. I
> think
>     > >> this flexibility also helps us to provide useful abstractions on
> top.
>     > >> >
>     > >> >3. Re-sharding would be based on the keys (re-hashing). The
> component
>     > id,
>     > >> task-id would be used to map to the namespace. So I am not clear
> about
>     > the
>     > >> concern you have raised.
>     > >> >
>     > >> >We could support dynamic state re-sharding via underlying higher
> level
>     > >> abstractions (e.g. Streams API) where we control the namespaces,
> keys
>     > etc
>     > >> and would be more manageable. IMO before we do this, we can
> introduce
>     > >> stateful exactly once processing via the streams API as the first
> step.
>     > >> >
>     > >> >Thanks,
>     > >> >Arun
>     > >> >
>     > >> >
>     > >> >
>     > >> >On 2/19/18, 4:59 PM, "Jungtaek Lim" <[email protected]> wrote:
>     > >> >
>     > >> >>Hi,
>     > >> >>
>     > >> >>I'd like to verify my observation on current State
> implementation is
>     > >> >>correct, so that we could fix them if necessary and make plan
> for
>     > >> >>improvement.
>     > >> >>
>     > >> >>1. State is stored with namespace prefix which typically
> composes to
>     > >> >>(component id, task id) pair and it doesn't look like having
>     > >> classification
>     > >> >>for topology. Is this correct observation? If then I think
> that's
>     > worth
>     > >> to
>     > >> >>call it as 'critical' and it must be fixed.
>     > >> >>
>     > >> >>2. We're allowing end-users to put key of state, and also no
>     > restriction
>     > >> >>for grouping on stateful component. I feel such flexibility
> breaks the
>     > >> >>possibility to reshard state and end-users are required to
> implement
>     > >> their
>     > >> >>own reshard tool according to their topology state key
> distribution
>     > >> logic.
>     > >> >>I expect it will not happen on streams API (since it should be
> done
>     > with
>     > >> >>keyed stream) but wouldn't it better to also restrict such
> flexibility
>     > >> also
>     > >> >>for core API?
>     > >> >>
>     > >> >>3. Suppose we are going to support state resharding (for
> allowing
>     > change
>     > >> of
>     > >> >>parallelism) and we restrict to apply field grouping with key
> while
>     > >> >>connecting stateful component.
>     > >> >>Then key-value can be moved based on key (though finding and
> replacing
>     > >> task
>     > >> >>id may not be trivial if component name has '-'... we have same
> issue
>     > on
>     > >> >>metric name, so maybe time to restrict characters on topology
> name as
>     > >> well
>     > >> >>as component name?).
>     > >> >>Is it also true for window/partition/windowsystem state? I
> didn't
>     > take a
>     > >> >>deep look on window state (I would find a time) but it would be
> great
>     > if
>     > >> >>someone knowing the detail makes it clear.
>     > >> >>
>     > >> >>Thanks in advance,
>     > >> >>Jungtaek Lim (HeartSaVioR)
>     > >> >
>     > >>
>     >
>     >
>
>
>

Reply via email to