Re: [DISCUSS] Batch Profiler Feature Branch

Justin Leet Thu, 20 Sep 2018 09:06:38 -0700

The profile not being able to read from ZK feels like a fairly substantial,
if subtle, set of potential problems.  I'd like to see that in either
before merging or at least pretty soon after merging.  Is it a lot of work
to add that functionality based on where things are right now?


On Thu, Sep 20, 2018 at 9:59 AM Nick Allen <[email protected]> wrote:

> Here is another limitation that I just thought. It can only read a profile
> definition from a file.  It probably also makes sense to add an option that
> allows it to read the current Profiler configuration from Zookeeper.
>
>
> > Is it worth setting up a default config that pulls from the main indexing
> output?
>
> Yes, I think that makes sense.  We want the Batch Profiler to point to the
> right HDFS URL, no matter where/how Metron is deployed.  When Metron gets
> spun-up on a cluster, I should be able to just run the Batch Profiler
> without having to fuss with the input path.
>
>
>
>
>
> On Thu, Sep 20, 2018 at 9:46 AM Justin Leet <[email protected]> wrote:
>
> > Re:
> >
> > >  * You do not configure the Batch Profiler in Ambari.  It is configured
> > > and executed completely from the command-line.
> > >
> >
> > Is it worth setting up a default config that pulls from the main indexing
> > output?  I'm a little on the fence about it, but it seems like making the
> > most common case more or less built-in would be nice.
> >
> > Having said that, I do not consider that a requirement for merging the
> > feature branch.
> >
> > On Wed, Sep 19, 2018 at 11:23 AM James Sirota <[email protected]>
> wrote:
> >
> > > I think what you have outlined above is a good initial stab at the
> > > feature.  Manual install of spark is not a big deal.  Configuring via
> > > command line while we mature this feature is ok as well.  Doesn't look
> > like
> > > configuration steps are too hard.  I think you should merge.
> > >
> > > James
> > >
> > > 19.09.2018, 08:15, "Nick Allen" <[email protected]>:
> > > > I would like to open a discussion to get the Batch Profiler feature
> > > branch
> > > > merged into master as part of METRON-1699 [1] Create Batch Profiler.
> > All
> > > > of the work that I had in mind for our first draft of the Batch
> > Profiler
> > > > has been completed. Please take a look through what I have and let me
> > > know
> > > > if there are other features that you think are required *before* we
> > > merge.
> > > >
> > > > Previous list discussions on this topic include [2] and [3].
> > > >
> > > > (Q) What can I do with the feature branch?
> > > >
> > > >   * With the Batch Profiler, you can backfill/seed profiles using
> > > archived
> > > > telemetry. This enables the following types of use cases.
> > > >
> > > >       1. As a Security Data Scientist, I want to understand the
> > > historical
> > > > behaviors and trends of a profile that I have created so that I can
> > > > determine if I have created a feature set that has predictive value
> for
> > > > model building.
> > > >
> > > >       2. As a Security Data Scientist, I want to understand the
> > > historical
> > > > behaviors and trends of a profile that I have created so that I can
> > > > determine if I have defined the profile correctly and created a
> feature
> > > set
> > > > that matches reality.
> > > >
> > > >       3. As a Security Platform Engineer, I want to generate a
> profile
> > > > using archived telemetry when I deploy a new model to production so
> > that
> > > > models depending on that profile can function on day 1.
> > > >
> > > >   * METRON-1699 [1] includes a more detailed description of the
> > feature.
> > > >
> > > > (Q) What work was completed?
> > > >
> > > >   * The Batch Profiler runs on Spark and was implemented in Java to
> > > remain
> > > > consistent with our current Java-heavy code base.
> > > >
> > > >   * The Batch Profiler is executed from the command-line. It can be
> > > > launched using a script or by calling `spark-submit`, which may be
> > useful
> > > > for advanced users.
> > > >
> > > >   * Input telemetry can be consumed from multiple sources; for
> example
> > > HDFS
> > > > or the local file system.
> > > >
> > > >   * Input telemetry can be consumed in multiple formats; for example
> > JSON
> > > > or ORC.
> > > >
> > > >   * The 'output' profile measurements are persisted in HBase and is
> > > > consistent with the Storm Profiler.
> > > >
> > > >   * It can be run on any underlying engine supported by Spark. I have
> > > > tested it both in 'local' mode and on a YARN cluster.
> > > >
> > > >   * It is installed automatically by the Metron MPack.
> > > >
> > > >   * A README was added that documents usage instructions.
> > > >
> > > >   * The existing Profiler code was refactored so that as much code as
> > > > possible is shared between the 3 Profiler ports; Storm, the Stellar
> > REPL,
> > > > and Spark. For example, the logic which determines the timestamp of a
> > > > message was refactored so that it could be reused by all ports.
> > > >
> > > >       * metron-profiler-common: The common Profiler code shared
> amongst
> > > > each port.
> > > >       * metron-profiler-storm: Profiler on Storm
> > > >       * metron-profiler-spark: Profiler on Spark
> > > >       * metron-profiler-repl: Profiler on the Stellar REPL
> > > >       * metron-profiler-client: The client code for retrieving
> profile
> > > > data; for example PROFILE_GET.
> > > >
> > > >   * There are 3 separate RPM and DEB packages now created for the
> > > Profiler.
> > > >
> > > >       * metron-profiler-storm-*.rpm
> > > >       * metron-profiler-spark-*.rpm
> > > >       * metron-profiler-repl-*.rpm
> > > >
> > > >   * The Profiler integration tests were enhanced to leverage the
> > Profiler
> > > > Client logic to validate the results.
> > > >
> > > >   * Review METRON-1699 [1] for a complete break-down of the tasks
> that
> > > have
> > > > been completed on the feature branch.
> > > >
> > > > (Q) What limitations exist?
> > > >
> > > >   * You must manually install Spark to use the Batch Profiler. The
> > Metron
> > > > MPack does not treat Spark as a Metron dependency and so does not
> > install
> > > > it automatically.
> > > >
> > > >   * You do not configure the Batch Profiler in Ambari. It is
> configured
> > > > and executed completely from the command-line.
> > > >
> > > >   * To run the Batch Profiler in 'Full Dev', you have to take the
> > > following
> > > > manual steps. Some of these are arguably limitations with how Ambari
> > > > installs Spark 2 in the version of HDP that we run.
> > > >
> > > >       1. Install Spark 2 using Ambari.
> > > >
> > > >       2. Tell Spark how to talk with HBase.
> > > >
> > > >         SPARK_HOME=/usr/hdp/current/spark2-client
> > > >         cp /usr/hdp/current/hbase-client/conf/hbase-site.xml
> > > > $SPARK_HOME/conf/
> > > >
> > > >       3. Create the Spark History directory in HDFS.
> > > >
> > > >         export HADOOP_USER_NAME=hdfs
> > > >         hdfs dfs -mkdir /spark2-history
> > > >
> > > >       4. Change the default input path to `hdfs://localhost:8020/...`
> > to
> > > > match the port defined by HDP, instead of port 9000.
> > > >
> > > > [1] https://issues.apache.org/jira/browse/METRON-1699
> > > > [2]
> > > >
> > >
> >
> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
> > > > [3]
> > > >
> > >
> >
> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E
> > >
> > > -------------------
> > > Thank you,
> > >
> > > James Sirota
> > > PMC- Apache Metron
> > > jsirota AT apache DOT org
> > >
> > >
> >
>

Re: [DISCUSS] Batch Profiler Feature Branch

Reply via email to