I do want to say here, that I don't mean to sound the alarm and say that everything is broken. I would not characterize the topology as "broken" architecturally, but rather the lack of reporting when things go pear-shaped is a bug in implementation. With logging and documentation about the knobs to tune, this architecture works, I believe.
On Tue, May 16, 2017 at 12:09 PM, Casey Stella <ceste...@gmail.com> wrote: > We could definitely parallelize within the bolt, but you're right, it does > break the storm model. I also like making things other people's problems > (it's called working "smart" not "hard", right? not laziness, surely. ;), > but yeah, using windowing for this seems like it might introduce some > artificial latency. It's also not going to eliminate the problem, but > rather just make the knob to tweak things have a different characteristic. > Whereas before we have knobs around how many messages, now it's a knob > around how long an enrichment is going to take maximally (which, I think is > more natural, honestly). > > On Tue, May 16, 2017 at 12:05 PM, Simon Elliston Ball < > si...@simonellistonball.com> wrote: > >> Would you then parallelise within Stellar to handle things like multiple >> lookups? This feels like it would be breaking the storm model somewhat, and >> could lead to bad things with threads for example. Or would you think of >> doing something like the grouping Stellar uses today to parallelise across >> something like a pool of Stellar bolts and join? >> >> I like the idea of Otto’s solution (making it someone else's problem, >> storm’s specifically :) ) but that also assumes we insert the artificial >> latency of a time windowed join. If we’re going down that route, we might >> as well just use spark and run everything on yarn. At that point though we >> lose a lot of the benefits of low latency for time to detection, and >> real-time enrichment in things like the streaming enrichment writer. >> >> Simon >> >> > On 16 May 2017, at 16:59, Nick Allen <n...@nickallen.org> wrote: >> > >> > I would like to see us just migrate wholly to Stellar enrichments and >> > remove the separate HBase and Geo enrichment bolts from the Enrichment >> > topology. Stellar provides a user with much greater flexibility than >> the >> > existing HBase and Geo enrichment bolts. >> > >> > A side effect of this would be to greatly simplify the Enrichment >> > topology. I don't think we would not need the split/join pattern if we >> did >> > this. No? >> > >> > On Tue, May 16, 2017 at 11:54 AM, Casey Stella <ceste...@gmail.com> >> wrote: >> > >> >> The problem is that an enrichment type won't necessarily have a fixed >> >> performance characteristic. Take stellar enrichments, for instance. >> Doing >> >> a HBase call for one sensor vs doing simple string munging will have >> vastly >> >> differing performance. Both of them are functioning within the stellar >> >> enrichment bolt. Also, some enrichments may call for multiple calls to >> >> HBase. Parallelizing those, would make some sense, I think. >> >> >> >> I do take your point, though, that it's not as though it's strictly >> serial, >> >> it's just that the unit of parallelism is the message, rather than the >> >> enrichment per message. >> >> >> >> On Tue, May 16, 2017 at 11:47 AM, Christian Tramnitz < >> tramn...@trasec.de> >> >> wrote: >> >> >> >>> I’m glad you bring this up. This is a huge architectural difference >> from >> >>> the original OpenSOC topology and one that we have been warned to take >> >> back >> >>> then. >> >>> To be perfectly honest, I don’t see the big perfomance improvement >> from >> >>> parallel processing. If a specific enrichment is a little more i/o >> >>> dependent than the other you can tweak parallelism to address this. >> Also >> >>> there can be dependencies that make parallel enrichment virtually >> >>> impossible or at least less efficient (i.e. first labeling, and >> >>> “completing” a message and then dependent of label and completeness do >> >>> different other enrichments). >> >>> >> >>> So you have a +1 from me for serial rather than parallel enrichment. >> >>> >> >>> >> >>> BR, >> >>> Christian >> >>> >> >>> On 16.05.17, 16:58, "Casey Stella" <ceste...@gmail.com> wrote: >> >>> >> >>> Hi All, >> >>> >> >>> Last week, I encountered some weirdness in the Enrichment topology. >> >>> Doing >> >>> some somewhat high-latency enrichment work, I noticed that at some >> >>> point, >> >>> data stopped flowing through the enrichment topology. I tracked >> down >> >>> the >> >>> problem to the join bolt. For those who aren't aware, we do a >> >>> split/join >> >>> pattern so that enrichments can be done in parallel. It works as >> >>> follows: >> >>> >> >>> - A split bolt sends the appropriate subset of the message to >> each >> >>> enrichment bolt as well as the whole message to the join bolt >> >>> - The join bolt will receive each of the pieces of the message >> and >> >>> then, >> >>> when fully joined, it will send the message on. >> >>> >> >>> >> >>> What is happening under load or high velocity, however, is that the >> >>> cache >> >>> is evicting the partially joined message before it can be fully >> >> joined >> >>> due >> >>> to the volume of traffic. This is obviously not ideal. As such, >> it >> >> is >> >>> clear that adjusting the size of the cache and the characteristics >> of >> >>> eviction is likely a good idea and a necessary part to tuning >> >>> enrichments. >> >>> The cache size is sensitive to: >> >>> >> >>> - The latency of the *slowest* enrichment >> >>> - The number of tuples in flight at once >> >>> >> >>> As such, the knobs you have to tune are either the parallelism of >> the >> >>> join >> >>> bolt or the size of the cache. >> >>> >> >>> As it stands, I see a couple of things wrong here that we can >> correct >> >>> with >> >>> minimal issue: >> >>> >> >>> - We have no message of warning indicating that this is >> happening >> >>> - Changing cache sizes means changing flux. We should promote >> >> this >> >>> to >> >>> the properties file. >> >>> - We should document the knobs mentioned above clearly in the >> >>> enrichment >> >>> topology README >> >>> >> >>> Those small changes, I think, are table stakes, but what I wanted >> to >> >>> discuss more in depth is the lingering questions: >> >>> >> >>> - Is this an architectural pattern that we can use as-is? >> >>> - Should we consider a persistent cache a la HBase or Apache >> >>> Ignite >> >>> as a pluggable component to Metron? >> >>> - Should we consider taking the performance hit and doing the >> >>> enrichments serially? >> >>> - When an eviction happens, what should we do? >> >>> - Fail the tuple, thereby making congestion worse >> >>> - Pass through the partially enriched results, thereby making >> >>> enrichments "best effort" >> >>> >> >>> Anyway, I wanted to talk this through and inform of some of the >> >> things >> >>> I'm >> >>> seeing. >> >>> >> >>> Sorry for the novel. ;) >> >>> >> >>> Casey >> >>> >> >>> >> >>> >> >> >> >> >