This sounds worth exploring. A couple of questions:
* how does this effect the distribution of work through the cluster, and
resiliency of the topologies?
* Is anyone else doing it like this?
* Can we have multiple thread pools and group tasks together ( or separate
them ) wrt hbase?
On February 22, 2018 at 11:32:39, Casey Stella (ceste...@gmail.com) wrote:
I've been thinking and working on something that I wanted to get some
feedback on. The way that we do our enrichments, the split/join
architecture was created to effectively to parallel enrichments in a
storm-like way in contrast to OpenSoc.
There are some good parts to this architecture:
- It works, enrichments are done in parallel
- You can tune individual enrichments differently
- It's very storm-like
There are also some deficiencies:
- It's hard to reason about
- Understanding the latency of enriching a message requires looking
at multiple bolts that each give summary statistics
- The join bolt's cache is really hard to reason about when performance
- During spikes in traffic, you can overload the join bolt's cache
and drop messages if you aren't careful
- In general, it's hard to associate a cache size and a duration kept
in cache with throughput and latency
- There are a lot of network hops per message
- Right now we are stuck at 2 stages of transformations being done
(enrichment and threat intel). It's very possible that you might want
stellar enrichments to depend on the output of other stellar enrichments.
In order to implement this in split/join you'd have to create a cycle in
the storm topology
I propose a change. I propose that we move to a model where we do
enrichments in a single bolt in parallel using a static threadpool (e.g.
multiple workers in the same process would share the threadpool). IN all
other ways, this would be backwards compatible. A transparent drop-in for
the existing enrichment topology.
There are some pros/cons about this too:
- Easier to reason about from an individual message perspective
- Architecturally decoupled from Storm
- This sets us up if we want to consider other streaming
- Fewer bolts
- spout -> enrichment bolt -> threatintel bolt -> output bolt
- Way fewer network hops per message
- currently 2n+1 where n is the number of enrichments used (if
using stellar subgroups, each subgroup is a hop)
- Easier to reason about from a performance perspective
- We trade cache size and eviction timeout for threadpool size
- We set ourselves up to have stellar subgroups with dependencies
- i.e. stellar subgroups that depend on the output of other
- If we do this, we can shrink the topology to just spout ->
enrichment/threat intel -> output
- We can no longer tune stellar enrichments independent from HBase
- To be fair, with enrichments moving to stellar, this is the case
in the split/join approach too
- No idea about performance
What I propose is to submit a PR that will deliver an alternative,
completely backwards compatible topology for enrichment that you can use by
adjusting the start_enrichment_topology.sh script to use
remote-unified.yaml instead of remote.yaml. If we live with it for a while
and have some good experiences with it, maybe we can consider retiring the
old enrichment topology.
Thoughts? Keep me honest; if I have over or understated the issues for
split/join or missed some important architectural issue let me know. I'm
going to submit a PR to this effect by the EOD today so things will be more