+1 on the concept of a useful replay capability.  Thoughts:

1. I agree with Otto’s observation that ML training has the same needs as you 
mention for Profiling, especially use cases #1 and #3, but also use case #2.

2. For use case #2, does it require (or at least significantly benefit from) a 
time-shifting capability during replay?  Time shifting is also very useful for 
“looping” a small amount of data into a long sequence of data more 
realistically than just multiple replay.  Time shifting the metadata in the 
tuple is relatively easy.  Time shifting the message content requires another 
pass through the Parsers.  Time shifting enrichment data may or may not be an 
issue, but if desired probably requires another pass through the Enricher.

3. Otto’s replaying of data through updated Enrichers to get newer enrichment 
info is also valid (and mentioned somewhere in documentation as a potential 
future feature, I think).

4. [2] and [3], plus the need to NOT add replayed data to the Index and Store, 
together suggest a flow control capability that allows selectively replaying 
the data through some elements but not others, with modal control for the 
Parsers and Enrichers. 

5. We should probably add a metadatum to replayed data, indicating it is 
replayed, somewhat like scientists label gene-modified organisms.  This 
prevents accidental use of the data as real.  It also would make it easier to 
implement [4] above, selective control of replay through some but not all 
elements of the topologies, and modal control.

6. Will the replay be from PCAP or from Tuple Store?  From Tuple Store makes 
most of the above easier, I think.

--Matt

On 11/29/16, 10:07 AM, "[email protected]" <[email protected]> wrote:

    At a high level I see a tremendous amount of value in the ability to push
    data into Metron (API places data on a Kafka topic), perform either a
    customized or default (as specified in zk) set of actions (maybe enrich,
    maybe profile, etc.), then configure the method of return (maybe persist in
    search&HDFS and return OK, maybe return JSON).  This also falls in line
    with what was discussed in METRON-477 at a very high level, although that
    has more of a focus on the data retention side of things.
    
    Jon
    
    On Tue, Nov 29, 2016 at 12:26 PM Otto Fowler <[email protected]>
    wrote:
    
    I think these are valid cases, but that there is a more general ‘replay’
    functionality with other cases as well.  I would think that Metron may
    require a general replay story across those cases.
    
    * replay to MaaS much the same as you have here
    * replay of data for updated enrichment/triage/threat intel
    * running some MaaS, Profiling, Triage/Threat completely and *always* on
    demand
    
    
    
    
    On November 29, 2016 at 12:08:36, Nick Allen ([email protected]) wrote:
    
    I would love any feedback from the community. Is this useful? How should
    this work? What use cases do you envision? What features do we need to
    support this? Feel free to respond in this thread or on the JIRA itself.
    
    METRON-594 <https://issues.apache.org/jira/browse/METRON-594>
    
    
    The Profiler currently consumes live telemetry, in real-time, as it is
    streamed through Metron. A useful extension of this functionality would
    allow the Profiler to also consume archived, historical telemetry. Allowing
    a user to selectively replay archived, historical raw telemetry through the
    Profiler has a number of applications. The following use cases help
    describe why this might be useful.
    
    Use Case 1 - Model Development
    
    When developing a new model, I often need a feature set of historical data
    on which to train my model. I can either wait days, weeks, months for the
    Profiler to generate this based on live data or I could re-run the raw,
    historical telemetry through the Profiler to get started immediately. It is
    much simpler to use the same mechanism to create this historical data set,
    than a separate batch-driven tool to recreate something that approximates
    the historical feature set.
    
    Use Case 2 - Model Deployment
    
    When deploying an analytical model to a new environment, like production,
    on day 1 there is often no historical data for the model to work with. This
    often leaves a gap between when the model is deployed and when that model
    is actually useful. If I could replay raw telemetry through the profiler a
    historical feature set could be created as part of the deployment process.
    This allows my model to start functioning on day 1.
    
    Use Case 3 - Profile Validation
    
    When creating a Profile, it is difficult to understand how the configured
    profile might behave against the entire data set. By creating the profile
    and watching it consume real-time streaming data, I only have an
    understanding of how it behaves on that small segment of data. If I am able
    to replay historical telemetry, I can instantly understand how it behaves
    on a much larger data set; including all the anomalies and exceptions that
    exist in all large data sets.
    
    
    
    
    
    --
    Nick Allen <[email protected]>
    
    -- 
    
    Jon
    
    Sent from my mobile device
    




Reply via email to