I’m glad you bring this up. This is a huge architectural difference from the 
original OpenSOC topology and one that we have been warned to take back then.
To be perfectly honest, I don’t see the big perfomance improvement from 
parallel processing. If a specific enrichment is a little more i/o dependent 
than the other you can tweak parallelism to address this. Also there can be 
dependencies that make parallel enrichment virtually impossible or at least 
less efficient (i.e. first labeling, and “completing” a message and then 
dependent of label and completeness do different other enrichments).

So you have a +1 from me for serial rather than parallel enrichment.


BR,
   Christian

On 16.05.17, 16:58, "Casey Stella" <ceste...@gmail.com> wrote:

    Hi All,
    
    Last week, I encountered some weirdness in the Enrichment topology.  Doing
    some somewhat high-latency enrichment work, I noticed that at some point,
    data stopped flowing through the enrichment topology.  I tracked down the
    problem to the join bolt.  For those who aren't aware, we do a split/join
    pattern so that enrichments can be done in parallel.  It works as follows:
    
       - A split bolt sends the appropriate subset of the message to each
       enrichment bolt as well as the whole message to the join bolt
       - The join bolt will receive each of the pieces of the message and then,
       when fully joined, it will send the message on.
    
    
    What is happening under load or high velocity, however, is that the cache
    is evicting the partially joined message before it can be fully joined due
    to the volume of traffic.  This is obviously not ideal.  As such, it is
    clear that adjusting the size of the cache and the characteristics of
    eviction is likely a good idea and a necessary part to tuning enrichments.
    The cache size is sensitive to:
    
       - The latency of the *slowest* enrichment
       - The number of tuples in flight at once
    
    As such, the knobs you have to tune are either the parallelism of the join
    bolt or the size of the cache.
    
    As it stands, I see a couple of things wrong here that we can correct with
    minimal issue:
    
       - We have no message of warning indicating that this is happening
       - Changing cache sizes means changing flux.  We should promote this to
       the properties file.
       - We should document the knobs mentioned above clearly in the enrichment
       topology README
    
    Those small changes, I think, are table stakes, but what I wanted to
    discuss more in depth is the lingering questions:
    
       - Is this an architectural pattern that we can use as-is?
          - Should we consider a persistent cache a la HBase or Apache Ignite
          as a pluggable component to Metron?
          - Should we consider taking the performance hit and doing the
          enrichments serially?
       - When an eviction happens, what should we do?
          - Fail the tuple, thereby making congestion worse
          - Pass through the partially enriched results, thereby making
          enrichments "best effort"
    
    Anyway, I wanted to talk this through and inform of some of the things I'm
    seeing.
    
    Sorry for the novel. ;)
    
    Casey
    

Reply via email to