ON demand operations, as a general feature should be able to specify output as well as input.
User wants to run one of or a set of parser | stellar | enrichment | profile | modeling on demand and output to HDFS | ZIP | INDEX | extendable On November 29, 2016 at 16:24:01, [email protected] ([email protected]) wrote: Regarding #4 - I would suggest that needs to be configurable. This would be useful if there was an issue with persisting and you wanted to replay to get data into the cluster - essentially a data load job. Also +1 to time shifting. On Tue, Nov 29, 2016 at 2:39 PM Matt Foley <[email protected]> wrote: > +1 on the concept of a useful replay capability. Thoughts: > > 1. I agree with Otto’s observation that ML training has the same needs as > you mention for Profiling, especially use cases #1 and #3, but also use > case #2. > > 2. For use case #2, does it require (or at least significantly benefit > from) a time-shifting capability during replay? Time shifting is also very > useful for “looping” a small amount of data into a long sequence of data > more realistically than just multiple replay. Time shifting the metadata > in the tuple is relatively easy. Time shifting the message content > requires another pass through the Parsers. Time shifting enrichment data > may or may not be an issue, but if desired probably requires another pass > through the Enricher. > > 3. Otto’s replaying of data through updated Enrichers to get newer > enrichment info is also valid (and mentioned somewhere in documentation as > a potential future feature, I think). > > 4. [2] and [3], plus the need to NOT add replayed data to the Index and > Store, together suggest a flow control capability that allows selectively > replaying the data through some elements but not others, with modal control > for the Parsers and Enrichers. > > 5. We should probably add a metadatum to replayed data, indicating it is > replayed, somewhat like scientists label gene-modified organisms. This > prevents accidental use of the data as real. It also would make it easier > to implement [4] above, selective control of replay through some but not > all elements of the topologies, and modal control. > > 6. Will the replay be from PCAP or from Tuple Store? From Tuple Store > makes most of the above easier, I think. > > --Matt > > On 11/29/16, 10:07 AM, "[email protected]" <[email protected]> wrote: > > At a high level I see a tremendous amount of value in the ability to > push > data into Metron (API places data on a Kafka topic), perform either a > customized or default (as specified in zk) set of actions (maybe > enrich, > maybe profile, etc.), then configure the method of return (maybe > persist in > search&HDFS and return OK, maybe return JSON). This also falls in line > with what was discussed in METRON-477 at a very high level, although > that > has more of a focus on the data retention side of things. > > Jon > > On Tue, Nov 29, 2016 at 12:26 PM Otto Fowler <[email protected]> > wrote: > > I think these are valid cases, but that there is a more general > ‘replay’ > functionality with other cases as well. I would think that Metron may > require a general replay story across those cases. > > * replay to MaaS much the same as you have here > * replay of data for updated enrichment/triage/threat intel > * running some MaaS, Profiling, Triage/Threat completely and *always* > on > demand > > > > > On November 29, 2016 at 12:08:36, Nick Allen ([email protected]) > wrote: > > I would love any feedback from the community. Is this useful? How > should > this work? What use cases do you envision? What features do we need to > support this? Feel free to respond in this thread or on the JIRA > itself. > > METRON-594 <https://issues.apache.org/jira/browse/METRON-594> > > > The Profiler currently consumes live telemetry, in real-time, as it is > streamed through Metron. A useful extension of this functionality would > allow the Profiler to also consume archived, historical telemetry. > Allowing > a user to selectively replay archived, historical raw telemetry > through the > Profiler has a number of applications. The following use cases help > describe why this might be useful. > > Use Case 1 - Model Development > > When developing a new model, I often need a feature set of historical > data > on which to train my model. I can either wait days, weeks, months for > the > Profiler to generate this based on live data or I could re-run the raw, > historical telemetry through the Profiler to get started immediately. > It is > much simpler to use the same mechanism to create this historical data > set, > than a separate batch-driven tool to recreate something that > approximates > the historical feature set. > > Use Case 2 - Model Deployment > > When deploying an analytical model to a new environment, like > production, > on day 1 there is often no historical data for the model to work with. > This > often leaves a gap between when the model is deployed and when that > model > is actually useful. If I could replay raw telemetry through the > profiler a > historical feature set could be created as part of the deployment > process. > This allows my model to start functioning on day 1. > > Use Case 3 - Profile Validation > > When creating a Profile, it is difficult to understand how the > configured > profile might behave against the entire data set. By creating the > profile > and watching it consume real-time streaming data, I only have an > understanding of how it behaves on that small segment of data. If I am > able > to replay historical telemetry, I can instantly understand how it > behaves > on a much larger data set; including all the anomalies and exceptions > that > exist in all large data sets. > > > > > > -- > Nick Allen <[email protected]> > > -- > > Jon > > Sent from my mobile device > > > > > > -- Jon Sent from my mobile device
