> 1. Right now users need to vary the database/table name in the state
provider config per topology.

Redis Cluster doesn't support multi database and it is uncommon usage for
Redis to use even tens of databases, so end users can't do the needed thing
with Redis state backend. For HBase state backend they can do, but creating
table requires administrator permission of HBase, which someone may not
want to open to multiple users. So I feel including topology name in
namespace looks like "mandatory" rather than "good to have".

> 2. As of now storm only supports rebalancing the tasks (task count
remains the same).

That is not correct. We are still missed to provide the feature in UI (I'm
not familiar with front-end so I had in mind but I didn't work on that) but
you can rebalance with different parallelism of component via cli.

Below is the rebalance command provided via cli. Quoting command line
client document[1]:

rebalance
Syntax: storm rebalance topology-name [-w wait-time-secs] [-n
new-num-workers] [-e component=parallelism]*

And we're sure that elasticity is "the thing" to provide, and for
elasticity we may even want to adjust parallelism on the fly (ideally,
would not that easy).

> IMO, users should have the flexibility to store multiple key-values in
the state in the core API. ... I think this flexibility also helps us to
provide useful abstractions on top.

I agree most beneficial aspect from core API is flexibility and less
restriction. I'm just worrying about state flexibility which should be
handled with caution. For example, without proper restriction (especially
grouping) duplicated keys in different tasks can happen, and even some end
users are brave to create their own state reshard tool, they need to deal
with merging the values for same key. Most end users may not recognize the
situation and hard to deal with, which doesn't seem friendly.

I feel we need to at least document recommended usage (selecting key, using
field grouping with selected key, selecting same key as key of state KV)
with notice/caution effects for other usages, and also need to provide
state reshard tool for such usage. The tool can be used for states handled
with streams API.

> 3. Re-sharding would be based on the keys (re-hashing). The component id,
task-id would be used to map to the namespace. So I am not clear about the
concern you have raised.

Yes resharding would be based on the keys, which means keys should be moved
to different task id. So to move the key in state backend, we need to parse
and modify the namespace part, and we use delimiter which is allowed to
topology name as well as component name. (I don't think namespace is badly
composed. The problem is not restricting topology name and component name.
We are allowing too many flexibility and being stuck from that.)

Windowed state is another thing we should deal with. I still didn't have
time to deep dive with, but to support resharding, window (and effectively
any states) should be grouped by key so that it can be placed to right
(might be new) task after relaunching topology. I'm asking that
current window/partition/windowsystem
is capable of.

> IMO before we do this, we can introduce stateful exactly once processing
via the streams API as the first step.

While I definitely agree that introducing stateful exactly-once processing
is major feature and the thing we should support sooner than later, I'm not
100% sure about which thing to do first. We're supporting at-least-once
stateful processing in 1.x version line and haven't provided "state
reshard" feature for a long time which makes actual users directly suffer,
so that's another major feature to me.

- Jungtaek Lim (HeartSaVioR)

1. http://storm.apache.org/releases/1.2.1/Command-line-client.html


2018년 2월 21일 (수) 오전 4:22, Arun Iyer <[email protected]>님이 작성:

> correction: task count remains the same (executor count can vary).
>
>
>
>
> On 2/20/18, 11:20 AM, "Arun Iyer on behalf of Arun Mahadevan" <
> [email protected] on behalf of [email protected]> wrote:
>
> >Hi Jungtaek,
> >
> >
> >1. Right now users need to vary the database/table name in the state
> provider config per topology. Agree, its better to include topology in the
> namespace.
> >
> >2. IMO, users should have the flexibility to store multiple key-values in
> the state in the core API. As of now storm only supports rebalancing the
> tasks (executor count remains the same). So users need not worry about
> re-sharding their state as long as they use the right grouping. I think
> this flexibility also helps us to provide useful abstractions on top.
> >
> >3. Re-sharding would be based on the keys (re-hashing). The component id,
> task-id would be used to map to the namespace. So I am not clear about the
> concern you have raised.
> >
> >We could support dynamic state re-sharding via underlying higher level
> abstractions (e.g. Streams API) where we control the namespaces, keys etc
> and would be more manageable. IMO before we do this, we can introduce
> stateful exactly once processing via the streams API as the first step.
> >
> >Thanks,
> >Arun
> >
> >
> >
> >On 2/19/18, 4:59 PM, "Jungtaek Lim" <[email protected]> wrote:
> >
> >>Hi,
> >>
> >>I'd like to verify my observation on current State implementation is
> >>correct, so that we could fix them if necessary and make plan for
> >>improvement.
> >>
> >>1. State is stored with namespace prefix which typically composes to
> >>(component id, task id) pair and it doesn't look like having
> classification
> >>for topology. Is this correct observation? If then I think that's worth
> to
> >>call it as 'critical' and it must be fixed.
> >>
> >>2. We're allowing end-users to put key of state, and also no restriction
> >>for grouping on stateful component. I feel such flexibility breaks the
> >>possibility to reshard state and end-users are required to implement
> their
> >>own reshard tool according to their topology state key distribution
> logic.
> >>I expect it will not happen on streams API (since it should be done with
> >>keyed stream) but wouldn't it better to also restrict such flexibility
> also
> >>for core API?
> >>
> >>3. Suppose we are going to support state resharding (for allowing change
> of
> >>parallelism) and we restrict to apply field grouping with key while
> >>connecting stateful component.
> >>Then key-value can be moved based on key (though finding and replacing
> task
> >>id may not be trivial if component name has '-'... we have same issue on
> >>metric name, so maybe time to restrict characters on topology name as
> well
> >>as component name?).
> >>Is it also true for window/partition/windowsystem state? I didn't take a
> >>deep look on window state (I would find a time) but it would be great if
> >>someone knowing the detail makes it clear.
> >>
> >>Thanks in advance,
> >>Jungtaek Lim (HeartSaVioR)
> >
>

Reply via email to