Re: [DISCUSS] Batch Profiler Feature Branch

Justin Leet Thu, 20 Sep 2018 09:50:02 -0700

I think the main difference between this and the flatfile loader is that we
actively maintain our profiles in ZK for streaming.  Doing this from files
is likely going to be the main usage, particularly for speculative usage.


For me, the main use case for ZK is definitely use case 3.

I can definitely be persuaded that this isn't a blocker for right now, but
I think there will be problems in practice from not having the
functionality. E.g. "We want to refresh everything because of mistake X,
and nobody refreshed the file/ZK and they've diverged".  While nobody likes
to refresh prod data (or some subset), I have seen it happen in literally
every single project I've worked on.  On dev/integration environments this
is even more likely.  Most people probably aren't going to store these
files in their version control (even though they probably should) and these
sort of divergences will happen.

 It's just cleaner from a usage/management perspective to say "I want to
put a profile in prod, just use streaming profiler and the batch profiler
with the same setup and they're good to go."

On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic <
michael.miklav...@gmail.com> wrote:

> Ok, makes sense. That's sort of what I was thinking as well, Nick. Pulling
> at this thread just a bit more...
>
>    1. I have an existing system that's been up a while, and I have added k
>    profiles - assume these are the first profiles I've created.
>       1. I would have t0 - tm (where m is the time when the profiles were
>       first installed) worth of data that has not been profiled yet.
>       2. The batch profiler process would be to take that exact profile
>       definition from ZK and run the batch loader with that from the CLI.
>       3. Profiles are now up to date from t0 - tCurrent
>    2. I've already done #1 above. Time goes by and now I want to add a new
>    profile.
>       1. Same first step above
>       2. I would run the batch loader with *only* that new profile
>       definition to seed?
>
> Forgive me if I missed this in PR's and discussion in the FB, but how do we
> establish "tm" from 1.1 above? Any concerns about overlap or gaps after the
> seeding is performed?
>
> On Thu, Sep 20, 2018 at 10:26 AM Nick Allen <n...@nickallen.org> wrote:
>
> > I think more often than not, you would want to load your profile
> definition
> > from a file.  This is why I considered the 'load from Zk' more of a
> > nice-to-have.
> >
> >    - In use case 1 and 2, this would definitely be the case.  The
> profiles
> >    I am working with are speculative and I am using the batch profiler to
> >    determine if they are worth keeping.  In this case, my speculative
> > profiles
> >    would not be in Zk (yet).
> >    - In use case 3, I could see it go either way.  It might be useful to
> >    load from Zk, but it certainly isn't a blocker.
> >
> >
> > > So if the config does not correctly match the profiler config held in
> ZK
> > and
> > the user runs the batch seeding job, what happens?
> >
> > You would just get a profile that is slightly different over the entire
> > time span.  This is not a new risk.  If the user changes their Profile
> > definitions in Zk, the same thing would happen.
> >
> >
> > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic <
> > michael.miklav...@gmail.com> wrote:
> >
> > > I think I'm torn on this, specifically because it's batch and would
> > > generally be run as-needed. Justin, can you elaborate on your concerns
> > > there? This feels functionally very similar to our flat file loaders,
> > which
> > > all have inputs for config from the CLI only. On the other hand, our
> flat
> > > file loaders are not typically seeding an existing structure. My
> concern
> > of
> > > a local file profiler config stems from this stated goal:
> > > > The goal would be to enable “profile seeding” which allows profiles
> to
> > be
> > > populated from a time before the profile was created.
> > > So if the config does not correctly match the profiler config held in
> ZK
> > > and the user runs the batch seeding job, what happens?
> > >
> > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet <justinjl...@gmail.com>
> > > wrote:
> > >
> > > > The profile not being able to read from ZK feels like a fairly
> > > substantial,
> > > > if subtle, set of potential problems.  I'd like to see that in either
> > > > before merging or at least pretty soon after merging.  Is it a lot of
> > > work
> > > > to add that functionality based on where things are right now?
> > > >
> > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen <n...@nickallen.org>
> wrote:
> > > >
> > > > > Here is another limitation that I just thought. It can only read a
> > > > profile
> > > > > definition from a file.  It probably also makes sense to add an
> > option
> > > > that
> > > > > allows it to read the current Profiler configuration from
> Zookeeper.
> > > > >
> > > > >
> > > > > > Is it worth setting up a default config that pulls from the main
> > > > indexing
> > > > > output?
> > > > >
> > > > > Yes, I think that makes sense.  We want the Batch Profiler to point
> > to
> > > > the
> > > > > right HDFS URL, no matter where/how Metron is deployed.  When
> Metron
> > > gets
> > > > > spun-up on a cluster, I should be able to just run the Batch
> Profiler
> > > > > without having to fuss with the input path.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet <justinjl...@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Re:
> > > > > >
> > > > > > >  * You do not configure the Batch Profiler in Ambari.  It is
> > > > configured
> > > > > > > and executed completely from the command-line.
> > > > > > >
> > > > > >
> > > > > > Is it worth setting up a default config that pulls from the main
> > > > indexing
> > > > > > output?  I'm a little on the fence about it, but it seems like
> > making
> > > > the
> > > > > > most common case more or less built-in would be nice.
> > > > > >
> > > > > > Having said that, I do not consider that a requirement for
> merging
> > > the
> > > > > > feature branch.
> > > > > >
> > > > > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota <
> jsir...@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > > I think what you have outlined above is a good initial stab at
> > the
> > > > > > > feature.  Manual install of spark is not a big deal.
> Configuring
> > > via
> > > > > > > command line while we mature this feature is ok as well.
> Doesn't
> > > > look
> > > > > > like
> > > > > > > configuration steps are too hard.  I think you should merge.
> > > > > > >
> > > > > > > James
> > > > > > >
> > > > > > > 19.09.2018, 08:15, "Nick Allen" <n...@nickallen.org>:
> > > > > > > > I would like to open a discussion to get the Batch Profiler
> > > feature
> > > > > > > branch
> > > > > > > > merged into master as part of METRON-1699 [1] Create Batch
> > > > Profiler.
> > > > > > All
> > > > > > > > of the work that I had in mind for our first draft of the
> Batch
> > > > > > Profiler
> > > > > > > > has been completed. Please take a look through what I have
> and
> > > let
> > > > me
> > > > > > > know
> > > > > > > > if there are other features that you think are required
> > *before*
> > > we
> > > > > > > merge.
> > > > > > > >
> > > > > > > > Previous list discussions on this topic include [2] and [3].
> > > > > > > >
> > > > > > > > (Q) What can I do with the feature branch?
> > > > > > > >
> > > > > > > >   * With the Batch Profiler, you can backfill/seed profiles
> > using
> > > > > > > archived
> > > > > > > > telemetry. This enables the following types of use cases.
> > > > > > > >
> > > > > > > >       1. As a Security Data Scientist, I want to understand
> the
> > > > > > > historical
> > > > > > > > behaviors and trends of a profile that I have created so
> that I
> > > can
> > > > > > > > determine if I have created a feature set that has predictive
> > > value
> > > > > for
> > > > > > > > model building.
> > > > > > > >
> > > > > > > >       2. As a Security Data Scientist, I want to understand
> the
> > > > > > > historical
> > > > > > > > behaviors and trends of a profile that I have created so
> that I
> > > can
> > > > > > > > determine if I have defined the profile correctly and
> created a
> > > > > feature
> > > > > > > set
> > > > > > > > that matches reality.
> > > > > > > >
> > > > > > > >       3. As a Security Platform Engineer, I want to generate
> a
> > > > > profile
> > > > > > > > using archived telemetry when I deploy a new model to
> > production
> > > so
> > > > > > that
> > > > > > > > models depending on that profile can function on day 1.
> > > > > > > >
> > > > > > > >   * METRON-1699 [1] includes a more detailed description of
> the
> > > > > > feature.
> > > > > > > >
> > > > > > > > (Q) What work was completed?
> > > > > > > >
> > > > > > > >   * The Batch Profiler runs on Spark and was implemented in
> > Java
> > > to
> > > > > > > remain
> > > > > > > > consistent with our current Java-heavy code base.
> > > > > > > >
> > > > > > > >   * The Batch Profiler is executed from the command-line. It
> > can
> > > be
> > > > > > > > launched using a script or by calling `spark-submit`, which
> may
> > > be
> > > > > > useful
> > > > > > > > for advanced users.
> > > > > > > >
> > > > > > > >   * Input telemetry can be consumed from multiple sources;
> for
> > > > > example
> > > > > > > HDFS
> > > > > > > > or the local file system.
> > > > > > > >
> > > > > > > >   * Input telemetry can be consumed in multiple formats; for
> > > > example
> > > > > > JSON
> > > > > > > > or ORC.
> > > > > > > >
> > > > > > > >   * The 'output' profile measurements are persisted in HBase
> > and
> > > is
> > > > > > > > consistent with the Storm Profiler.
> > > > > > > >
> > > > > > > >   * It can be run on any underlying engine supported by
> Spark.
> > I
> > > > have
> > > > > > > > tested it both in 'local' mode and on a YARN cluster.
> > > > > > > >
> > > > > > > >   * It is installed automatically by the Metron MPack.
> > > > > > > >
> > > > > > > >   * A README was added that documents usage instructions.
> > > > > > > >
> > > > > > > >   * The existing Profiler code was refactored so that as much
> > > code
> > > > as
> > > > > > > > possible is shared between the 3 Profiler ports; Storm, the
> > > Stellar
> > > > > > REPL,
> > > > > > > > and Spark. For example, the logic which determines the
> > timestamp
> > > > of a
> > > > > > > > message was refactored so that it could be reused by all
> ports.
> > > > > > > >
> > > > > > > >       * metron-profiler-common: The common Profiler code
> shared
> > > > > amongst
> > > > > > > > each port.
> > > > > > > >       * metron-profiler-storm: Profiler on Storm
> > > > > > > >       * metron-profiler-spark: Profiler on Spark
> > > > > > > >       * metron-profiler-repl: Profiler on the Stellar REPL
> > > > > > > >       * metron-profiler-client: The client code for
> retrieving
> > > > > profile
> > > > > > > > data; for example PROFILE_GET.
> > > > > > > >
> > > > > > > >   * There are 3 separate RPM and DEB packages now created for
> > the
> > > > > > > Profiler.
> > > > > > > >
> > > > > > > >       * metron-profiler-storm-*.rpm
> > > > > > > >       * metron-profiler-spark-*.rpm
> > > > > > > >       * metron-profiler-repl-*.rpm
> > > > > > > >
> > > > > > > >   * The Profiler integration tests were enhanced to leverage
> > the
> > > > > > Profiler
> > > > > > > > Client logic to validate the results.
> > > > > > > >
> > > > > > > >   * Review METRON-1699 [1] for a complete break-down of the
> > tasks
> > > > > that
> > > > > > > have
> > > > > > > > been completed on the feature branch.
> > > > > > > >
> > > > > > > > (Q) What limitations exist?
> > > > > > > >
> > > > > > > >   * You must manually install Spark to use the Batch
> Profiler.
> > > The
> > > > > > Metron
> > > > > > > > MPack does not treat Spark as a Metron dependency and so does
> > not
> > > > > > install
> > > > > > > > it automatically.
> > > > > > > >
> > > > > > > >   * You do not configure the Batch Profiler in Ambari. It is
> > > > > configured
> > > > > > > > and executed completely from the command-line.
> > > > > > > >
> > > > > > > >   * To run the Batch Profiler in 'Full Dev', you have to take
> > the
> > > > > > > following
> > > > > > > > manual steps. Some of these are arguably limitations with how
> > > > Ambari
> > > > > > > > installs Spark 2 in the version of HDP that we run.
> > > > > > > >
> > > > > > > >       1. Install Spark 2 using Ambari.
> > > > > > > >
> > > > > > > >       2. Tell Spark how to talk with HBase.
> > > > > > > >
> > > > > > > >         SPARK_HOME=/usr/hdp/current/spark2-client
> > > > > > > >         cp /usr/hdp/current/hbase-client/conf/hbase-site.xml
> > > > > > > > $SPARK_HOME/conf/
> > > > > > > >
> > > > > > > >       3. Create the Spark History directory in HDFS.
> > > > > > > >
> > > > > > > >         export HADOOP_USER_NAME=hdfs
> > > > > > > >         hdfs dfs -mkdir /spark2-history
> > > > > > > >
> > > > > > > >       4. Change the default input path to
> > > > `hdfs://localhost:8020/...`
> > > > > > to
> > > > > > > > match the port defined by HDP, instead of port 9000.
> > > > > > > >
> > > > > > > > [1] https://issues.apache.org/jira/browse/METRON-1699
> > > > > > > > [2]
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
> > > > > > > > [3]
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E
> > > > > > >
> > > > > > > -------------------
> > > > > > > Thank you,
> > > > > > >
> > > > > > > James Sirota
> > > > > > > PMC- Apache Metron
> > > > > > > jsirota AT apache DOT org
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Batch Profiler Feature Branch

Reply via email to