[
https://issues.apache.org/jira/browse/METRON-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nick Allen updated METRON-1699:
-------------------------------
Description:
Create a Batch Profiler that satisfies the following use cases.
h3. Use Cases
* As a Security Data Scientist, I want to understand the historical behaviors
and trends of a profile that I have created so that I can determine if I have
created a feature set that has predictive value for model building.
* As a Security Data Scientist, I want to understand the historical behaviors
and trends of a profile that I have created so that I can determine if I have
defined the profile correctly and created a feature set that matches reality.
* As a Security Platform Engineer, I want to generate a profile using archived
telemetry when I deploy a new model to production so that models depending on
that profile can function on day 1.
h3. Goal
* Currently, a profile can only be generated from the telemetry consumed
*after* the profile was created.
* The goal would be to enable “profile seeding” which allows profiles to be
populated from a time *before* the profile was created.
* A profile would be seeded using the telemetry that has been archived by
Metron in HDFS.
* A profile consumer should not be able to distinguish the “seeded” portion of
a profile.
!Screen Shot 2018-07-27 at 10.55.27 AM.png!
h3. Current State
* There are currently two ports of the Profiler; the Streaming Profiler that
handles streaming data in Storm and the other that runs in the REPL and allows
a user to manually build, test, and debug profiles.
* These ports largely share a common code base in
metron-analytics/metron-profiler-common.
* A smaller set of “orchestration” logic is required to maintain each port;
one for Storm, another for the REPL.
* Both Profiler ports supports both system time and event time processing.
!Screen Shot 2018-07-27 at 11.07.33 AM.png!
h3. Approach
* Create a third port of the Profiler; the Batch Profiler.
* The Batch Profiler will be built to run in Spark so that the telemetry can
be consumed in batch.
* Allows a user to seed profiles using the JSON telemetry that is archived in
HDFS by Metron Indexing.
* Only generates the profile data stored in HBase, not the messages that are
produced for Threat Triage and Kafka.
* Any number of profiles can be generated at once, but no dependencies between
the profiles are supported. A dependency is where one profile is a consumer of
the profile generated by another.
* The Batch Profiler must use the timestamps contained within the telemetry;
it runs on event time. Luckily the Profiler already supports event time.
* Enable a pluggable mechanism so that telemetry stored in different formats
can be consumed by the Batch Profiler. For example, the Profiler should be able
to consume telemetry stored as raw JSON or in other formats like ORC or
Parquet.
!Screen Shot 2018-07-27 at 11.10.16 AM.png!
was:
Create a Batch Profiler that satisfies the following use cases.
h3. Use Cases
* As a Security Data Scientist, I want to understand the historical behaviors
and trends of a profile that I have created so that I can determine if I have
created a feature set that has predictive value for model building.
* As a Security Data Scientist, I want to understand the historical behaviors
and trends of a profile that I have created so that I can determine if I have
defined the profile correctly and created a feature set that matches reality.
* As a Security Platform Engineer, I want to generate a profile using archived
telemetry when I deploy a new model to production so that models depending on
that profile can function on day 1.
h3. Goal
* Currently, a profile can only be generated from the telemetry consumed
*after* the profile was created.
* The goal would be to enable “profile seeding” which allows profiles to be
populated from a time *before* the profile was created.
* A profile would be seeded using the telemetry that has been archived by
Metron in HDFS.
* A profile consumer should not be able to distinguish the “seeded” portion of
a profile.
!Screen Shot 2018-07-27 at 10.55.27 AM.png!
h3. Current State
* There are currently two ports of the Profiler; the Streaming Profiler that
handles streaming data in Storm and the other that runs in the REPL and allows
a user to manually build, test, and debug profiles.
* These ports largely share a common code base in
metron-analytics/metron-profiler-common.
* A smaller set of “orchestration” logic is required to maintain each port; one
for Storm, another for the REPL.
* Both Profiler ports supports both system time and event time processing.
!Screen Shot 2018-07-27 at 11.07.33 AM.png!
> Create Batch Profiler
> ---------------------
>
> Key: METRON-1699
> URL: https://issues.apache.org/jira/browse/METRON-1699
> Project: Metron
> Issue Type: Improvement
> Reporter: Nick Allen
> Assignee: Nick Allen
> Priority: Major
> Attachments: Screen Shot 2018-07-27 at 10.55.27 AM.png, Screen Shot
> 2018-07-27 at 11.07.33 AM.png, Screen Shot 2018-07-27 at 11.10.16 AM.png
>
>
> Create a Batch Profiler that satisfies the following use cases.
> h3. Use Cases
> * As a Security Data Scientist, I want to understand the historical
> behaviors and trends of a profile that I have created so that I can determine
> if I have created a feature set that has predictive value for model building.
> * As a Security Data Scientist, I want to understand the historical
> behaviors and trends of a profile that I have created so that I can determine
> if I have defined the profile correctly and created a feature set that
> matches reality.
> * As a Security Platform Engineer, I want to generate a profile using
> archived telemetry when I deploy a new model to production so that models
> depending on that profile can function on day 1.
> h3. Goal
> * Currently, a profile can only be generated from the telemetry consumed
> *after* the profile was created.
> * The goal would be to enable “profile seeding” which allows profiles to be
> populated from a time *before* the profile was created.
> * A profile would be seeded using the telemetry that has been archived by
> Metron in HDFS.
> * A profile consumer should not be able to distinguish the “seeded” portion
> of a profile.
> !Screen Shot 2018-07-27 at 10.55.27 AM.png!
> h3. Current State
> * There are currently two ports of the Profiler; the Streaming Profiler that
> handles streaming data in Storm and the other that runs in the REPL and
> allows a user to manually build, test, and debug profiles.
> * These ports largely share a common code base in
> metron-analytics/metron-profiler-common.
> * A smaller set of “orchestration” logic is required to maintain each port;
> one for Storm, another for the REPL.
> * Both Profiler ports supports both system time and event time processing.
> !Screen Shot 2018-07-27 at 11.07.33 AM.png!
> h3. Approach
> * Create a third port of the Profiler; the Batch Profiler.
> * The Batch Profiler will be built to run in Spark so that the telemetry can
> be consumed in batch.
> * Allows a user to seed profiles using the JSON telemetry that is archived
> in HDFS by Metron Indexing.
> * Only generates the profile data stored in HBase, not the messages that are
> produced for Threat Triage and Kafka.
> * Any number of profiles can be generated at once, but no dependencies
> between the profiles are supported. A dependency is where one profile is a
> consumer of the profile generated by another.
> * The Batch Profiler must use the timestamps contained within the telemetry;
> it runs on event time. Luckily the Profiler already supports event time.
> * Enable a pluggable mechanism so that telemetry stored in different formats
> can be consumed by the Batch Profiler. For example, the Profiler should be
> able to consume telemetry stored as raw JSON or in other formats like ORC or
> Parquet.
> !Screen Shot 2018-07-27 at 11.10.16 AM.png!
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)