Re: [DISCUSS] Enrichment Split/Join issues

zeo...@gmail.com Tue, 16 May 2017 10:12:30 -0700

The field stub also gives something that can potentially be used in the
error dashboard (or similar) to graph, allowing failed enrichments to
"shout" louder to the end user.


Jon

On Tue, May 16, 2017 at 12:34 PM Nick Allen <n...@nickallen.org> wrote:

> > but also adds a field stub to indicate failed enrichment. This is then an
> indicator to an operator or investigator as well that something is missing,
> and could drive things like replay of the message to retrospectively enrich
> when things have calmed down.
>
> Yes, I like the idea of a "field stub".  You need some way to distinguish
> "did I configure this wrong" versus "something bad happened outside of my
> control".
>
>
>
> On Tue, May 16, 2017 at 12:27 PM, Simon Elliston Ball <
> si...@simonellistonball.com> wrote:
>
> > Nick, I’d tend to agree with you there.
> >
> > How about:
> > If an enrichment fails / effectively times out, the join bolt emits the
> > message before cache eviction (as Nick’s point 2), but also adds a field
> > stub to indicate failed enrichment. This is then an indicator to an
> > operator or investigator as well that something is missing, and could
> drive
> > things like replay of the message to retrospectively enrich when things
> > have calmed down.
> >
> > Simon
> >
> > > On 16 May 2017, at 17:25, Nick Allen <n...@nickallen.org> wrote:
> > >
> > > Ah, yes.  Makes sense and I can see the value in the parallelism that
> the
> > > split/join provides.  Personally, I would like to see the code do the
> > > following.
> > >
> > > (1) Scream and shout when something in the cache expires.  We have to
> > make
> > > sure that it is blatantly obvious to a user what happened.  We also
> need
> > to
> > > make it blatantly obvious to the user what knobs they can turn to
> correct
> > > the problem.
> > >
> > > (2) Enrichments should be treated as best-effort.  When the cache
> > expires,
> > > it should pass on the message without the enrichments that have not
> > > completed.  If I am relying on an external system for an enrichment, I
> > > don't want an external system outage to fail all of my telemetry.
> > >
> > >
> > >
> > >
> > >
> > > On Tue, May 16, 2017 at 12:05 PM, Casey Stella <ceste...@gmail.com>
> > wrote:
> > >
> > >> We still do use split/join even within stellar enrichments.  Take for
> > >> instance the following enrichment:
> > >> {
> > >>  "enrichment" : {
> > >>    "fieldMap" : {
> > >>      "stellar" : {
> > >>         "config" : {
> > >>             "parallel-task-1" : {
> > >>                 "my_field" : "PROFILE_GET(....)"
> > >>             },
> > >>             "parallel-task-2" : {
> > >>                 "my_field2" : "PROFILE_GET(....)"
> > >>             }
> > >>         }
> > >>      }
> > >>    }
> > >>  }
> > >>
> > >> Messages will get split between two tasks of the Stellar enrichment
> bolt
> > >> and the stellar statements in "parallel-task-1" will be executed in
> > >> parallel to those in "parallel-task-2".  This is to enable people to
> > >> separate computationally intensive or otherwise high latency tasks
> that
> > are
> > >> independent across nodes in the cluster.
> > >>
> > >> I will agree wholeheartedly, though, that my personal desire would be
> to
> > >> have just stellar enrichments, though.  You can do every one of the
> > other
> > >> enrichments in Stellar and it would greatly simplify that config
> above.
> > >>
> > >>
> > >>
> > >> On Tue, May 16, 2017 at 11:59 AM, Nick Allen <n...@nickallen.org>
> > wrote:
> > >>
> > >>> I would like to see us just migrate wholly to Stellar enrichments and
> > >>> remove the separate HBase and Geo enrichment bolts from the
> Enrichment
> > >>> topology.  Stellar provides a user with much greater flexibility than
> > the
> > >>> existing HBase and Geo enrichment bolts.
> > >>>
> > >>> A side effect of this would be to greatly simplify the Enrichment
> > >>> topology.  I don't think we would not need the split/join pattern if
> we
> > >> did
> > >>> this. No?
> > >>>
> > >>> On Tue, May 16, 2017 at 11:54 AM, Casey Stella <ceste...@gmail.com>
> > >> wrote:
> > >>>
> > >>>> The problem is that an enrichment type won't necessarily have a
> fixed
> > >>>> performance characteristic.  Take stellar enrichments, for instance.
> > >>> Doing
> > >>>> a HBase call for one sensor vs doing simple string munging will have
> > >>> vastly
> > >>>> differing performance.  Both of them are functioning within the
> > stellar
> > >>>> enrichment bolt.  Also, some enrichments may call for multiple calls
> > to
> > >>>> HBase.  Parallelizing those, would make some sense, I think.
> > >>>>
> > >>>> I do take your point, though, that it's not as though it's strictly
> > >>> serial,
> > >>>> it's just that the unit of parallelism is the message, rather than
> the
> > >>>> enrichment per message.
> > >>>>
> > >>>> On Tue, May 16, 2017 at 11:47 AM, Christian Tramnitz <
> > >> tramn...@trasec.de
> > >>>>
> > >>>> wrote:
> > >>>>
> > >>>>> I’m glad you bring this up. This is a huge architectural difference
> > >>> from
> > >>>>> the original OpenSOC topology and one that we have been warned to
> > >> take
> > >>>> back
> > >>>>> then.
> > >>>>> To be perfectly honest, I don’t see the big perfomance improvement
> > >> from
> > >>>>> parallel processing. If a specific enrichment is a little more i/o
> > >>>>> dependent than the other you can tweak parallelism to address this.
> > >>> Also
> > >>>>> there can be dependencies that make parallel enrichment virtually
> > >>>>> impossible or at least less efficient (i.e. first labeling, and
> > >>>>> “completing” a message and then dependent of label and completeness
> > >> do
> > >>>>> different other enrichments).
> > >>>>>
> > >>>>> So you have a +1 from me for serial rather than parallel
> enrichment.
> > >>>>>
> > >>>>>
> > >>>>> BR,
> > >>>>>   Christian
> > >>>>>
> > >>>>> On 16.05.17, 16:58, "Casey Stella" <ceste...@gmail.com> wrote:
> > >>>>>
> > >>>>>    Hi All,
> > >>>>>
> > >>>>>    Last week, I encountered some weirdness in the Enrichment
> > >> topology.
> > >>>>> Doing
> > >>>>>    some somewhat high-latency enrichment work, I noticed that at
> > >> some
> > >>>>> point,
> > >>>>>    data stopped flowing through the enrichment topology.  I tracked
> > >>> down
> > >>>>> the
> > >>>>>    problem to the join bolt.  For those who aren't aware, we do a
> > >>>>> split/join
> > >>>>>    pattern so that enrichments can be done in parallel.  It works
> as
> > >>>>> follows:
> > >>>>>
> > >>>>>       - A split bolt sends the appropriate subset of the message to
> > >>> each
> > >>>>>       enrichment bolt as well as the whole message to the join bolt
> > >>>>>       - The join bolt will receive each of the pieces of the
> message
> > >>> and
> > >>>>> then,
> > >>>>>       when fully joined, it will send the message on.
> > >>>>>
> > >>>>>
> > >>>>>    What is happening under load or high velocity, however, is that
> > >> the
> > >>>>> cache
> > >>>>>    is evicting the partially joined message before it can be fully
> > >>>> joined
> > >>>>> due
> > >>>>>    to the volume of traffic.  This is obviously not ideal.  As
> such,
> > >>> it
> > >>>> is
> > >>>>>    clear that adjusting the size of the cache and the
> > >> characteristics
> > >>> of
> > >>>>>    eviction is likely a good idea and a necessary part to tuning
> > >>>>> enrichments.
> > >>>>>    The cache size is sensitive to:
> > >>>>>
> > >>>>>       - The latency of the *slowest* enrichment
> > >>>>>       - The number of tuples in flight at once
> > >>>>>
> > >>>>>    As such, the knobs you have to tune are either the parallelism
> of
> > >>> the
> > >>>>> join
> > >>>>>    bolt or the size of the cache.
> > >>>>>
> > >>>>>    As it stands, I see a couple of things wrong here that we can
> > >>> correct
> > >>>>> with
> > >>>>>    minimal issue:
> > >>>>>
> > >>>>>       - We have no message of warning indicating that this is
> > >>> happening
> > >>>>>       - Changing cache sizes means changing flux.  We should
> promote
> > >>>> this
> > >>>>> to
> > >>>>>       the properties file.
> > >>>>>       - We should document the knobs mentioned above clearly in the
> > >>>>> enrichment
> > >>>>>       topology README
> > >>>>>
> > >>>>>    Those small changes, I think, are table stakes, but what I
> wanted
> > >>> to
> > >>>>>    discuss more in depth is the lingering questions:
> > >>>>>
> > >>>>>       - Is this an architectural pattern that we can use as-is?
> > >>>>>          - Should we consider a persistent cache a la HBase or
> > >> Apache
> > >>>>> Ignite
> > >>>>>          as a pluggable component to Metron?
> > >>>>>          - Should we consider taking the performance hit and doing
> > >> the
> > >>>>>          enrichments serially?
> > >>>>>       - When an eviction happens, what should we do?
> > >>>>>          - Fail the tuple, thereby making congestion worse
> > >>>>>          - Pass through the partially enriched results, thereby
> > >> making
> > >>>>>          enrichments "best effort"
> > >>>>>
> > >>>>>    Anyway, I wanted to talk this through and inform of some of the
> > >>>> things
> > >>>>> I'm
> > >>>>>    seeing.
> > >>>>>
> > >>>>>    Sorry for the novel. ;)
> > >>>>>
> > >>>>>    Casey
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> >
>
-- 

Jon

Re: [DISCUSS] Enrichment Split/Join issues

Reply via email to