+1 on the concept of a useful replay capability. Thoughts: 1. I agree with Otto’s observation that ML training has the same needs as you mention for Profiling, especially use cases #1 and #3, but also use case #2.
2. For use case #2, does it require (or at least significantly benefit from) a time-shifting capability during replay? Time shifting is also very useful for “looping” a small amount of data into a long sequence of data more realistically than just multiple replay. Time shifting the metadata in the tuple is relatively easy. Time shifting the message content requires another pass through the Parsers. Time shifting enrichment data may or may not be an issue, but if desired probably requires another pass through the Enricher. 3. Otto’s replaying of data through updated Enrichers to get newer enrichment info is also valid (and mentioned somewhere in documentation as a potential future feature, I think). 4. [2] and [3], plus the need to NOT add replayed data to the Index and Store, together suggest a flow control capability that allows selectively replaying the data through some elements but not others, with modal control for the Parsers and Enrichers. 5. We should probably add a metadatum to replayed data, indicating it is replayed, somewhat like scientists label gene-modified organisms. This prevents accidental use of the data as real. It also would make it easier to implement [4] above, selective control of replay through some but not all elements of the topologies, and modal control. 6. Will the replay be from PCAP or from Tuple Store? From Tuple Store makes most of the above easier, I think. --Matt On 11/29/16, 10:07 AM, "[email protected]" <[email protected]> wrote: At a high level I see a tremendous amount of value in the ability to push data into Metron (API places data on a Kafka topic), perform either a customized or default (as specified in zk) set of actions (maybe enrich, maybe profile, etc.), then configure the method of return (maybe persist in search&HDFS and return OK, maybe return JSON). This also falls in line with what was discussed in METRON-477 at a very high level, although that has more of a focus on the data retention side of things. Jon On Tue, Nov 29, 2016 at 12:26 PM Otto Fowler <[email protected]> wrote: I think these are valid cases, but that there is a more general ‘replay’ functionality with other cases as well. I would think that Metron may require a general replay story across those cases. * replay to MaaS much the same as you have here * replay of data for updated enrichment/triage/threat intel * running some MaaS, Profiling, Triage/Threat completely and *always* on demand On November 29, 2016 at 12:08:36, Nick Allen ([email protected]) wrote: I would love any feedback from the community. Is this useful? How should this work? What use cases do you envision? What features do we need to support this? Feel free to respond in this thread or on the JIRA itself. METRON-594 <https://issues.apache.org/jira/browse/METRON-594> The Profiler currently consumes live telemetry, in real-time, as it is streamed through Metron. A useful extension of this functionality would allow the Profiler to also consume archived, historical telemetry. Allowing a user to selectively replay archived, historical raw telemetry through the Profiler has a number of applications. The following use cases help describe why this might be useful. Use Case 1 - Model Development When developing a new model, I often need a feature set of historical data on which to train my model. I can either wait days, weeks, months for the Profiler to generate this based on live data or I could re-run the raw, historical telemetry through the Profiler to get started immediately. It is much simpler to use the same mechanism to create this historical data set, than a separate batch-driven tool to recreate something that approximates the historical feature set. Use Case 2 - Model Deployment When deploying an analytical model to a new environment, like production, on day 1 there is often no historical data for the model to work with. This often leaves a gap between when the model is deployed and when that model is actually useful. If I could replay raw telemetry through the profiler a historical feature set could be created as part of the deployment process. This allows my model to start functioning on day 1. Use Case 3 - Profile Validation When creating a Profile, it is difficult to understand how the configured profile might behave against the entire data set. By creating the profile and watching it consume real-time streaming data, I only have an understanding of how it behaves on that small segment of data. If I am able to replay historical telemetry, I can instantly understand how it behaves on a much larger data set; including all the anomalies and exceptions that exist in all large data sets. -- Nick Allen <[email protected]> -- Jon Sent from my mobile device
