FYI, the PR for this is up at https://github.com/apache/metron/pull/940
For those interested, please comment on the actual implementation there.
On Thu, Feb 22, 2018 at 12:43 PM, Casey Stella <ceste...@gmail.com> wrote:
> So, these are good questions, as usual Otto :)
> > how does this effect the distribution of work through the cluster, and
> resiliency of the topologies?
> This moves us to a data parallelism scheme rather than a task parallelism
> scheme. This, in effect means, that we will not be distributing the
> partial enrichments across the network for a given message, but rather
> distributing the messages across the network for *full* enrichment. So,
> the bundle of work is the same, but we're not concentrating capabilities in
> specific workers. Then again, as soon as we moved to stellar enrichments
> and sub-groups where you can interact with hbase or geo from within
> stellar, we sorta abandoned specialization. Resiliency shouldn't be
> effected and, indeed, it should be easier to reason about. We ack after
> every bolt in the new scheme rather than avoid acking until we join and ack
> the original tuple. In fact, I'm still not convinced there's not a bug
> somewhere in that join bolt that makes it so we don't ack the right tuple.
> > Is anyone else doing it like this?
> The stormy way of doing this is to specialize in the bolts and join, no
> doubt, in a fan-out/fan-in pattern. I do not think it's unheard of,
> though, to use a threadpool. It's slightly peculiar inasmuch as storm has
> its own threading model, but it is an embarassingly parallel task and the
> main shift is trading the unit of parallelism from enrichment task to
> message to the gain of fewer network hops. That being said, as long as
> you're not emitting from a different thread that you are receiving from,
> there's no technical limitation.
> > Can we have multiple thread pools and group tasks together ( or separate
> them ) wrt hbase?
> We could, but I think we might consider starting with just a simple static
> threadpool that we configure at the topology level (e.g. multiple worker
> threads can share the same threadpool that we can configure). I think as
> the trend of moving everything to stellar continues, we may end up in a
> situation where we don't have a coherent or clear way to differentiate
> between thread pools like we do now.
> > Also, how are we to measure the effect?
> Well, some of the benefits here are at an architectural/feature level, the
> most exciting of which is that this approach opens up avenues for stellar
> subgroups to depend on each other. Slightly less exciting, but still nice
> is the fact that this normalizes us with *other* streaming technologies and
> the decoupling work done as part of the PR (soon to be released) will make
> it easy to transition if we so desire. Beyond that, for performance,
> someone will have to run some performance tests or try it out in a
> situation where they're having some enrichment performance issues. Until
> we do that, I think we should probably just keep it as a parallel approach
> that you can swap out if you so desire.
> On Thu, Feb 22, 2018 at 11:48 AM, Otto Fowler <ottobackwa...@gmail.com>
>> This sounds worth exploring. A couple of questions:
>> * how does this effect the distribution of work through the cluster, and
>> resiliency of the topologies?
>> * Is anyone else doing it like this?
>> * Can we have multiple thread pools and group tasks together ( or
>> separate them ) wrt hbase?
>> On February 22, 2018 at 11:32:39, Casey Stella (ceste...@gmail.com)
>> Hi all,
>> I've been thinking and working on something that I wanted to get some
>> feedback on. The way that we do our enrichments, the split/join
>> architecture was created to effectively to parallel enrichments in a
>> storm-like way in contrast to OpenSoc.
>> There are some good parts to this architecture:
>> - It works, enrichments are done in parallel
>> - You can tune individual enrichments differently
>> - It's very storm-like
>> There are also some deficiencies:
>> - It's hard to reason about
>> - Understanding the latency of enriching a message requires looking
>> at multiple bolts that each give summary statistics
>> - The join bolt's cache is really hard to reason about when performance
>> - During spikes in traffic, you can overload the join bolt's cache
>> and drop messages if you aren't careful
>> - In general, it's hard to associate a cache size and a duration kept
>> in cache with throughput and latency
>> - There are a lot of network hops per message
>> - Right now we are stuck at 2 stages of transformations being done
>> (enrichment and threat intel). It's very possible that you might want
>> stellar enrichments to depend on the output of other stellar enrichments.
>> In order to implement this in split/join you'd have to create a cycle in
>> the storm topology
>> I propose a change. I propose that we move to a model where we do
>> enrichments in a single bolt in parallel using a static threadpool (e.g.
>> multiple workers in the same process would share the threadpool). IN all
>> other ways, this would be backwards compatible. A transparent drop-in for
>> the existing enrichment topology.
>> There are some pros/cons about this too:
>> - Pro
>> - Easier to reason about from an individual message perspective
>> - Architecturally decoupled from Storm
>> - This sets us up if we want to consider other streaming
>> - Fewer bolts
>> - spout -> enrichment bolt -> threatintel bolt -> output bolt
>> - Way fewer network hops per message
>> - currently 2n+1 where n is the number of enrichments used (if
>> using stellar subgroups, each subgroup is a hop)
>> - Easier to reason about from a performance perspective
>> - We trade cache size and eviction timeout for threadpool size
>> - We set ourselves up to have stellar subgroups with dependencies
>> - i.e. stellar subgroups that depend on the output of other
>> - If we do this, we can shrink the topology to just spout ->
>> enrichment/threat intel -> output
>> - Con
>> - We can no longer tune stellar enrichments independent from HBase
>> - To be fair, with enrichments moving to stellar, this is the case
>> in the split/join approach too
>> - No idea about performance
>> What I propose is to submit a PR that will deliver an alternative,
>> completely backwards compatible topology for enrichment that you can use
>> adjusting the start_enrichment_topology.sh script to use
>> remote-unified.yaml instead of remote.yaml. If we live with it for a
>> and have some good experiences with it, maybe we can consider retiring
>> old enrichment topology.
>> Thoughts? Keep me honest; if I have over or understated the issues for
>> split/join or missed some important architectural issue let me know. I'm
>> going to submit a PR to this effect by the EOD today so things will be