Re: [DISCUSS] Batch Profiler Feature Branch

Michael Miklavcic Thu, 20 Sep 2018 09:15:59 -0700

I think I'm torn on this, specifically because it's batch and would
generally be run as-needed. Justin, can you elaborate on your concerns
there? This feels functionally very similar to our flat file loaders, which
all have inputs for config from the CLI only. On the other hand, our flat
file loaders are not typically seeding an existing structure. My concern of
a local file profiler config stems from this stated goal:
> The goal would be to enable “profile seeding” which allows profiles to be
populated from a time before the profile was created.
So if the config does not correctly match the profiler config held in ZK
and the user runs the batch seeding job, what happens?


On Thu, Sep 20, 2018 at 10:06 AM Justin Leet <[email protected]> wrote:

> The profile not being able to read from ZK feels like a fairly substantial,
> if subtle, set of potential problems.  I'd like to see that in either
> before merging or at least pretty soon after merging.  Is it a lot of work
> to add that functionality based on where things are right now?
>
> On Thu, Sep 20, 2018 at 9:59 AM Nick Allen <[email protected]> wrote:
>
> > Here is another limitation that I just thought. It can only read a
> profile
> > definition from a file.  It probably also makes sense to add an option
> that
> > allows it to read the current Profiler configuration from Zookeeper.
> >
> >
> > > Is it worth setting up a default config that pulls from the main
> indexing
> > output?
> >
> > Yes, I think that makes sense.  We want the Batch Profiler to point to
> the
> > right HDFS URL, no matter where/how Metron is deployed.  When Metron gets
> > spun-up on a cluster, I should be able to just run the Batch Profiler
> > without having to fuss with the input path.
> >
> >
> >
> >
> >
> > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet <[email protected]>
> wrote:
> >
> > > Re:
> > >
> > > >  * You do not configure the Batch Profiler in Ambari.  It is
> configured
> > > > and executed completely from the command-line.
> > > >
> > >
> > > Is it worth setting up a default config that pulls from the main
> indexing
> > > output?  I'm a little on the fence about it, but it seems like making
> the
> > > most common case more or less built-in would be nice.
> > >
> > > Having said that, I do not consider that a requirement for merging the
> > > feature branch.
> > >
> > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota <[email protected]>
> > wrote:
> > >
> > > > I think what you have outlined above is a good initial stab at the
> > > > feature.  Manual install of spark is not a big deal.  Configuring via
> > > > command line while we mature this feature is ok as well.  Doesn't
> look
> > > like
> > > > configuration steps are too hard.  I think you should merge.
> > > >
> > > > James
> > > >
> > > > 19.09.2018, 08:15, "Nick Allen" <[email protected]>:
> > > > > I would like to open a discussion to get the Batch Profiler feature
> > > > branch
> > > > > merged into master as part of METRON-1699 [1] Create Batch
> Profiler.
> > > All
> > > > > of the work that I had in mind for our first draft of the Batch
> > > Profiler
> > > > > has been completed. Please take a look through what I have and let
> me
> > > > know
> > > > > if there are other features that you think are required *before* we
> > > > merge.
> > > > >
> > > > > Previous list discussions on this topic include [2] and [3].
> > > > >
> > > > > (Q) What can I do with the feature branch?
> > > > >
> > > > >   * With the Batch Profiler, you can backfill/seed profiles using
> > > > archived
> > > > > telemetry. This enables the following types of use cases.
> > > > >
> > > > >       1. As a Security Data Scientist, I want to understand the
> > > > historical
> > > > > behaviors and trends of a profile that I have created so that I can
> > > > > determine if I have created a feature set that has predictive value
> > for
> > > > > model building.
> > > > >
> > > > >       2. As a Security Data Scientist, I want to understand the
> > > > historical
> > > > > behaviors and trends of a profile that I have created so that I can
> > > > > determine if I have defined the profile correctly and created a
> > feature
> > > > set
> > > > > that matches reality.
> > > > >
> > > > >       3. As a Security Platform Engineer, I want to generate a
> > profile
> > > > > using archived telemetry when I deploy a new model to production so
> > > that
> > > > > models depending on that profile can function on day 1.
> > > > >
> > > > >   * METRON-1699 [1] includes a more detailed description of the
> > > feature.
> > > > >
> > > > > (Q) What work was completed?
> > > > >
> > > > >   * The Batch Profiler runs on Spark and was implemented in Java to
> > > > remain
> > > > > consistent with our current Java-heavy code base.
> > > > >
> > > > >   * The Batch Profiler is executed from the command-line. It can be
> > > > > launched using a script or by calling `spark-submit`, which may be
> > > useful
> > > > > for advanced users.
> > > > >
> > > > >   * Input telemetry can be consumed from multiple sources; for
> > example
> > > > HDFS
> > > > > or the local file system.
> > > > >
> > > > >   * Input telemetry can be consumed in multiple formats; for
> example
> > > JSON
> > > > > or ORC.
> > > > >
> > > > >   * The 'output' profile measurements are persisted in HBase and is
> > > > > consistent with the Storm Profiler.
> > > > >
> > > > >   * It can be run on any underlying engine supported by Spark. I
> have
> > > > > tested it both in 'local' mode and on a YARN cluster.
> > > > >
> > > > >   * It is installed automatically by the Metron MPack.
> > > > >
> > > > >   * A README was added that documents usage instructions.
> > > > >
> > > > >   * The existing Profiler code was refactored so that as much code
> as
> > > > > possible is shared between the 3 Profiler ports; Storm, the Stellar
> > > REPL,
> > > > > and Spark. For example, the logic which determines the timestamp
> of a
> > > > > message was refactored so that it could be reused by all ports.
> > > > >
> > > > >       * metron-profiler-common: The common Profiler code shared
> > amongst
> > > > > each port.
> > > > >       * metron-profiler-storm: Profiler on Storm
> > > > >       * metron-profiler-spark: Profiler on Spark
> > > > >       * metron-profiler-repl: Profiler on the Stellar REPL
> > > > >       * metron-profiler-client: The client code for retrieving
> > profile
> > > > > data; for example PROFILE_GET.
> > > > >
> > > > >   * There are 3 separate RPM and DEB packages now created for the
> > > > Profiler.
> > > > >
> > > > >       * metron-profiler-storm-*.rpm
> > > > >       * metron-profiler-spark-*.rpm
> > > > >       * metron-profiler-repl-*.rpm
> > > > >
> > > > >   * The Profiler integration tests were enhanced to leverage the
> > > Profiler
> > > > > Client logic to validate the results.
> > > > >
> > > > >   * Review METRON-1699 [1] for a complete break-down of the tasks
> > that
> > > > have
> > > > > been completed on the feature branch.
> > > > >
> > > > > (Q) What limitations exist?
> > > > >
> > > > >   * You must manually install Spark to use the Batch Profiler. The
> > > Metron
> > > > > MPack does not treat Spark as a Metron dependency and so does not
> > > install
> > > > > it automatically.
> > > > >
> > > > >   * You do not configure the Batch Profiler in Ambari. It is
> > configured
> > > > > and executed completely from the command-line.
> > > > >
> > > > >   * To run the Batch Profiler in 'Full Dev', you have to take the
> > > > following
> > > > > manual steps. Some of these are arguably limitations with how
> Ambari
> > > > > installs Spark 2 in the version of HDP that we run.
> > > > >
> > > > >       1. Install Spark 2 using Ambari.
> > > > >
> > > > >       2. Tell Spark how to talk with HBase.
> > > > >
> > > > >         SPARK_HOME=/usr/hdp/current/spark2-client
> > > > >         cp /usr/hdp/current/hbase-client/conf/hbase-site.xml
> > > > > $SPARK_HOME/conf/
> > > > >
> > > > >       3. Create the Spark History directory in HDFS.
> > > > >
> > > > >         export HADOOP_USER_NAME=hdfs
> > > > >         hdfs dfs -mkdir /spark2-history
> > > > >
> > > > >       4. Change the default input path to
> `hdfs://localhost:8020/...`
> > > to
> > > > > match the port defined by HDP, instead of port 9000.
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/METRON-1699
> > > > > [2]
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
> > > > > [3]
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E
> > > >
> > > > -------------------
> > > > Thank you,
> > > >
> > > > James Sirota
> > > > PMC- Apache Metron
> > > > jsirota AT apache DOT org
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Batch Profiler Feature Branch

Reply via email to