ON demand operations, as a general feature should be able to specify output
as well as input.

User wants to run one of or a set of parser | stellar | enrichment |
profile | modeling on demand and output to HDFS | ZIP | INDEX | extendable



On November 29, 2016 at 16:24:01, [email protected] ([email protected]) wrote:

Regarding #4 - I would suggest that needs to be configurable. This would
be useful if there was an issue with persisting and you wanted to replay to
get data into the cluster - essentially a data load job.

Also +1 to time shifting.

On Tue, Nov 29, 2016 at 2:39 PM Matt Foley <[email protected]> wrote:

> +1 on the concept of a useful replay capability. Thoughts:
>
> 1. I agree with Otto’s observation that ML training has the same needs as
> you mention for Profiling, especially use cases #1 and #3, but also use
> case #2.
>
> 2. For use case #2, does it require (or at least significantly benefit
> from) a time-shifting capability during replay? Time shifting is also
very
> useful for “looping” a small amount of data into a long sequence of data
> more realistically than just multiple replay. Time shifting the metadata
> in the tuple is relatively easy. Time shifting the message content
> requires another pass through the Parsers. Time shifting enrichment data
> may or may not be an issue, but if desired probably requires another pass
> through the Enricher.
>
> 3. Otto’s replaying of data through updated Enrichers to get newer
> enrichment info is also valid (and mentioned somewhere in documentation
as
> a potential future feature, I think).
>
> 4. [2] and [3], plus the need to NOT add replayed data to the Index and
> Store, together suggest a flow control capability that allows selectively
> replaying the data through some elements but not others, with modal
control
> for the Parsers and Enrichers.
>
> 5. We should probably add a metadatum to replayed data, indicating it is
> replayed, somewhat like scientists label gene-modified organisms. This
> prevents accidental use of the data as real. It also would make it easier
> to implement [4] above, selective control of replay through some but not
> all elements of the topologies, and modal control.
>
> 6. Will the replay be from PCAP or from Tuple Store? From Tuple Store
> makes most of the above easier, I think.
>
> --Matt
>
> On 11/29/16, 10:07 AM, "[email protected]" <[email protected]> wrote:
>
> At a high level I see a tremendous amount of value in the ability to
> push
> data into Metron (API places data on a Kafka topic), perform either a
> customized or default (as specified in zk) set of actions (maybe
> enrich,
> maybe profile, etc.), then configure the method of return (maybe
> persist in
> search&HDFS and return OK, maybe return JSON). This also falls in line
> with what was discussed in METRON-477 at a very high level, although
> that
> has more of a focus on the data retention side of things.
>
> Jon
>
> On Tue, Nov 29, 2016 at 12:26 PM Otto Fowler <[email protected]>
> wrote:
>
> I think these are valid cases, but that there is a more general
> ‘replay’
> functionality with other cases as well. I would think that Metron may
> require a general replay story across those cases.
>
> * replay to MaaS much the same as you have here
> * replay of data for updated enrichment/triage/threat intel
> * running some MaaS, Profiling, Triage/Threat completely and *always*
> on
> demand
>
>
>
>
> On November 29, 2016 at 12:08:36, Nick Allen ([email protected])
> wrote:
>
> I would love any feedback from the community. Is this useful? How
> should
> this work? What use cases do you envision? What features do we need to
> support this? Feel free to respond in this thread or on the JIRA
> itself.
>
> METRON-594 <https://issues.apache.org/jira/browse/METRON-594>
>
>
> The Profiler currently consumes live telemetry, in real-time, as it is
> streamed through Metron. A useful extension of this functionality would
> allow the Profiler to also consume archived, historical telemetry.
> Allowing
> a user to selectively replay archived, historical raw telemetry
> through the
> Profiler has a number of applications. The following use cases help
> describe why this might be useful.
>
> Use Case 1 - Model Development
>
> When developing a new model, I often need a feature set of historical
> data
> on which to train my model. I can either wait days, weeks, months for
> the
> Profiler to generate this based on live data or I could re-run the raw,
> historical telemetry through the Profiler to get started immediately.
> It is
> much simpler to use the same mechanism to create this historical data
> set,
> than a separate batch-driven tool to recreate something that
> approximates
> the historical feature set.
>
> Use Case 2 - Model Deployment
>
> When deploying an analytical model to a new environment, like
> production,
> on day 1 there is often no historical data for the model to work with.
> This
> often leaves a gap between when the model is deployed and when that
> model
> is actually useful. If I could replay raw telemetry through the
> profiler a
> historical feature set could be created as part of the deployment
> process.
> This allows my model to start functioning on day 1.
>
> Use Case 3 - Profile Validation
>
> When creating a Profile, it is difficult to understand how the
> configured
> profile might behave against the entire data set. By creating the
> profile
> and watching it consume real-time streaming data, I only have an
> understanding of how it behaves on that small segment of data. If I am
> able
> to replay historical telemetry, I can instantly understand how it
> behaves
> on a much larger data set; including all the anomalies and exceptions
> that
> exist in all large data sets.
>
>
>
>
>
> --
> Nick Allen <[email protected]>
>
> --
>
> Jon
>
> Sent from my mobile device
>
>
>
>
>
> --

Jon

Sent from my mobile device

Reply via email to