Re: [DISCUSS] Batch Profiler Feature Branch

Michael Miklavcic Thu, 20 Sep 2018 10:23:07 -0700

So in the case of 3 - if you had 6 months of data that hadn't been profiled
and another 3 that had been profiled (9 months total data), in its current
form the batch job runs over all 9 months?


On Thu, Sep 20, 2018 at 11:13 AM Nick Allen <[email protected]> wrote:

> > How do we establish "tm" from 1.1 above? Any concerns about overlap or
> gaps after the seeding is performed?
>
> Good point.  Right now, if the Streaming and Batch Profiler overlap the
> last write wins.  And presumably the output of the Streaming and Batch
> Profiler are the same, so no worries, right? :)
>
> So it kind of works, but it is definitely not ideal for use case 3.  I
> could add --begin and --end args to constrain the time frame over which the
> Batch Profiler runs.  I do not have that in the feature branch.  It would
> be easy enough to add though.
>
>
>
> On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic <
> [email protected]> wrote:
>
> > Ok, makes sense. That's sort of what I was thinking as well, Nick.
> Pulling
> > at this thread just a bit more...
> >
> >    1. I have an existing system that's been up a while, and I have added
> k
> >    profiles - assume these are the first profiles I've created.
> >       1. I would have t0 - tm (where m is the time when the profiles were
> >       first installed) worth of data that has not been profiled yet.
> >       2. The batch profiler process would be to take that exact profile
> >       definition from ZK and run the batch loader with that from the CLI.
> >       3. Profiles are now up to date from t0 - tCurrent
> >    2. I've already done #1 above. Time goes by and now I want to add a
> new
> >    profile.
> >       1. Same first step above
> >       2. I would run the batch loader with *only* that new profile
> >       definition to seed?
> >
> > Forgive me if I missed this in PR's and discussion in the FB, but how do
> we
> > establish "tm" from 1.1 above? Any concerns about overlap or gaps after
> the
> > seeding is performed?
> >
> > On Thu, Sep 20, 2018 at 10:26 AM Nick Allen <[email protected]> wrote:
> >
> > > I think more often than not, you would want to load your profile
> > definition
> > > from a file.  This is why I considered the 'load from Zk' more of a
> > > nice-to-have.
> > >
> > >    - In use case 1 and 2, this would definitely be the case.  The
> > profiles
> > >    I am working with are speculative and I am using the batch profiler
> to
> > >    determine if they are worth keeping.  In this case, my speculative
> > > profiles
> > >    would not be in Zk (yet).
> > >    - In use case 3, I could see it go either way.  It might be useful
> to
> > >    load from Zk, but it certainly isn't a blocker.
> > >
> > >
> > > > So if the config does not correctly match the profiler config held in
> > ZK
> > > and
> > > the user runs the batch seeding job, what happens?
> > >
> > > You would just get a profile that is slightly different over the entire
> > > time span.  This is not a new risk.  If the user changes their Profile
> > > definitions in Zk, the same thing would happen.
> > >
> > >
> > > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic <
> > > [email protected]> wrote:
> > >
> > > > I think I'm torn on this, specifically because it's batch and would
> > > > generally be run as-needed. Justin, can you elaborate on your
> concerns
> > > > there? This feels functionally very similar to our flat file loaders,
> > > which
> > > > all have inputs for config from the CLI only. On the other hand, our
> > flat
> > > > file loaders are not typically seeding an existing structure. My
> > concern
> > > of
> > > > a local file profiler config stems from this stated goal:
> > > > > The goal would be to enable “profile seeding” which allows profiles
> > to
> > > be
> > > > populated from a time before the profile was created.
> > > > So if the config does not correctly match the profiler config held in
> > ZK
> > > > and the user runs the batch seeding job, what happens?
> > > >
> > > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet <[email protected]>
> > > > wrote:
> > > >
> > > > > The profile not being able to read from ZK feels like a fairly
> > > > substantial,
> > > > > if subtle, set of potential problems.  I'd like to see that in
> either
> > > > > before merging or at least pretty soon after merging.  Is it a lot
> of
> > > > work
> > > > > to add that functionality based on where things are right now?
> > > > >
> > > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen <[email protected]>
> > wrote:
> > > > >
> > > > > > Here is another limitation that I just thought. It can only read
> a
> > > > > profile
> > > > > > definition from a file.  It probably also makes sense to add an
> > > option
> > > > > that
> > > > > > allows it to read the current Profiler configuration from
> > Zookeeper.
> > > > > >
> > > > > >
> > > > > > > Is it worth setting up a default config that pulls from the
> main
> > > > > indexing
> > > > > > output?
> > > > > >
> > > > > > Yes, I think that makes sense.  We want the Batch Profiler to
> point
> > > to
> > > > > the
> > > > > > right HDFS URL, no matter where/how Metron is deployed.  When
> > Metron
> > > > gets
> > > > > > spun-up on a cluster, I should be able to just run the Batch
> > Profiler
> > > > > > without having to fuss with the input path.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet <
> [email protected]
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Re:
> > > > > > >
> > > > > > > >  * You do not configure the Batch Profiler in Ambari.  It is
> > > > > configured
> > > > > > > > and executed completely from the command-line.
> > > > > > > >
> > > > > > >
> > > > > > > Is it worth setting up a default config that pulls from the
> main
> > > > > indexing
> > > > > > > output?  I'm a little on the fence about it, but it seems like
> > > making
> > > > > the
> > > > > > > most common case more or less built-in would be nice.
> > > > > > >
> > > > > > > Having said that, I do not consider that a requirement for
> > merging
> > > > the
> > > > > > > feature branch.
> > > > > > >
> > > > > > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota <
> > [email protected]>
> > > > > > wrote:
> > > > > > >
> > > > > > > > I think what you have outlined above is a good initial stab
> at
> > > the
> > > > > > > > feature.  Manual install of spark is not a big deal.
> > Configuring
> > > > via
> > > > > > > > command line while we mature this feature is ok as well.
> > Doesn't
> > > > > look
> > > > > > > like
> > > > > > > > configuration steps are too hard.  I think you should merge.
> > > > > > > >
> > > > > > > > James
> > > > > > > >
> > > > > > > > 19.09.2018, 08:15, "Nick Allen" <[email protected]>:
> > > > > > > > > I would like to open a discussion to get the Batch Profiler
> > > > feature
> > > > > > > > branch
> > > > > > > > > merged into master as part of METRON-1699 [1] Create Batch
> > > > > Profiler.
> > > > > > > All
> > > > > > > > > of the work that I had in mind for our first draft of the
> > Batch
> > > > > > > Profiler
> > > > > > > > > has been completed. Please take a look through what I have
> > and
> > > > let
> > > > > me
> > > > > > > > know
> > > > > > > > > if there are other features that you think are required
> > > *before*
> > > > we
> > > > > > > > merge.
> > > > > > > > >
> > > > > > > > > Previous list discussions on this topic include [2] and
> [3].
> > > > > > > > >
> > > > > > > > > (Q) What can I do with the feature branch?
> > > > > > > > >
> > > > > > > > >   * With the Batch Profiler, you can backfill/seed profiles
> > > using
> > > > > > > > archived
> > > > > > > > > telemetry. This enables the following types of use cases.
> > > > > > > > >
> > > > > > > > >       1. As a Security Data Scientist, I want to understand
> > the
> > > > > > > > historical
> > > > > > > > > behaviors and trends of a profile that I have created so
> > that I
> > > > can
> > > > > > > > > determine if I have created a feature set that has
> predictive
> > > > value
> > > > > > for
> > > > > > > > > model building.
> > > > > > > > >
> > > > > > > > >       2. As a Security Data Scientist, I want to understand
> > the
> > > > > > > > historical
> > > > > > > > > behaviors and trends of a profile that I have created so
> > that I
> > > > can
> > > > > > > > > determine if I have defined the profile correctly and
> > created a
> > > > > > feature
> > > > > > > > set
> > > > > > > > > that matches reality.
> > > > > > > > >
> > > > > > > > >       3. As a Security Platform Engineer, I want to
> generate
> > a
> > > > > > profile
> > > > > > > > > using archived telemetry when I deploy a new model to
> > > production
> > > > so
> > > > > > > that
> > > > > > > > > models depending on that profile can function on day 1.
> > > > > > > > >
> > > > > > > > >   * METRON-1699 [1] includes a more detailed description of
> > the
> > > > > > > feature.
> > > > > > > > >
> > > > > > > > > (Q) What work was completed?
> > > > > > > > >
> > > > > > > > >   * The Batch Profiler runs on Spark and was implemented in
> > > Java
> > > > to
> > > > > > > > remain
> > > > > > > > > consistent with our current Java-heavy code base.
> > > > > > > > >
> > > > > > > > >   * The Batch Profiler is executed from the command-line.
> It
> > > can
> > > > be
> > > > > > > > > launched using a script or by calling `spark-submit`, which
> > may
> > > > be
> > > > > > > useful
> > > > > > > > > for advanced users.
> > > > > > > > >
> > > > > > > > >   * Input telemetry can be consumed from multiple sources;
> > for
> > > > > > example
> > > > > > > > HDFS
> > > > > > > > > or the local file system.
> > > > > > > > >
> > > > > > > > >   * Input telemetry can be consumed in multiple formats;
> for
> > > > > example
> > > > > > > JSON
> > > > > > > > > or ORC.
> > > > > > > > >
> > > > > > > > >   * The 'output' profile measurements are persisted in
> HBase
> > > and
> > > > is
> > > > > > > > > consistent with the Storm Profiler.
> > > > > > > > >
> > > > > > > > >   * It can be run on any underlying engine supported by
> > Spark.
> > > I
> > > > > have
> > > > > > > > > tested it both in 'local' mode and on a YARN cluster.
> > > > > > > > >
> > > > > > > > >   * It is installed automatically by the Metron MPack.
> > > > > > > > >
> > > > > > > > >   * A README was added that documents usage instructions.
> > > > > > > > >
> > > > > > > > >   * The existing Profiler code was refactored so that as
> much
> > > > code
> > > > > as
> > > > > > > > > possible is shared between the 3 Profiler ports; Storm, the
> > > > Stellar
> > > > > > > REPL,
> > > > > > > > > and Spark. For example, the logic which determines the
> > > timestamp
> > > > > of a
> > > > > > > > > message was refactored so that it could be reused by all
> > ports.
> > > > > > > > >
> > > > > > > > >       * metron-profiler-common: The common Profiler code
> > shared
> > > > > > amongst
> > > > > > > > > each port.
> > > > > > > > >       * metron-profiler-storm: Profiler on Storm
> > > > > > > > >       * metron-profiler-spark: Profiler on Spark
> > > > > > > > >       * metron-profiler-repl: Profiler on the Stellar REPL
> > > > > > > > >       * metron-profiler-client: The client code for
> > retrieving
> > > > > > profile
> > > > > > > > > data; for example PROFILE_GET.
> > > > > > > > >
> > > > > > > > >   * There are 3 separate RPM and DEB packages now created
> for
> > > the
> > > > > > > > Profiler.
> > > > > > > > >
> > > > > > > > >       * metron-profiler-storm-*.rpm
> > > > > > > > >       * metron-profiler-spark-*.rpm
> > > > > > > > >       * metron-profiler-repl-*.rpm
> > > > > > > > >
> > > > > > > > >   * The Profiler integration tests were enhanced to
> leverage
> > > the
> > > > > > > Profiler
> > > > > > > > > Client logic to validate the results.
> > > > > > > > >
> > > > > > > > >   * Review METRON-1699 [1] for a complete break-down of the
> > > tasks
> > > > > > that
> > > > > > > > have
> > > > > > > > > been completed on the feature branch.
> > > > > > > > >
> > > > > > > > > (Q) What limitations exist?
> > > > > > > > >
> > > > > > > > >   * You must manually install Spark to use the Batch
> > Profiler.
> > > > The
> > > > > > > Metron
> > > > > > > > > MPack does not treat Spark as a Metron dependency and so
> does
> > > not
> > > > > > > install
> > > > > > > > > it automatically.
> > > > > > > > >
> > > > > > > > >   * You do not configure the Batch Profiler in Ambari. It
> is
> > > > > > configured
> > > > > > > > > and executed completely from the command-line.
> > > > > > > > >
> > > > > > > > >   * To run the Batch Profiler in 'Full Dev', you have to
> take
> > > the
> > > > > > > > following
> > > > > > > > > manual steps. Some of these are arguably limitations with
> how
> > > > > Ambari
> > > > > > > > > installs Spark 2 in the version of HDP that we run.
> > > > > > > > >
> > > > > > > > >       1. Install Spark 2 using Ambari.
> > > > > > > > >
> > > > > > > > >       2. Tell Spark how to talk with HBase.
> > > > > > > > >
> > > > > > > > >         SPARK_HOME=/usr/hdp/current/spark2-client
> > > > > > > > >         cp
> /usr/hdp/current/hbase-client/conf/hbase-site.xml
> > > > > > > > > $SPARK_HOME/conf/
> > > > > > > > >
> > > > > > > > >       3. Create the Spark History directory in HDFS.
> > > > > > > > >
> > > > > > > > >         export HADOOP_USER_NAME=hdfs
> > > > > > > > >         hdfs dfs -mkdir /spark2-history
> > > > > > > > >
> > > > > > > > >       4. Change the default input path to
> > > > > `hdfs://localhost:8020/...`
> > > > > > > to
> > > > > > > > > match the port defined by HDP, instead of port 9000.
> > > > > > > > >
> > > > > > > > > [1] https://issues.apache.org/jira/browse/METRON-1699
> > > > > > > > > [2]
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
> > > > > > > > > [3]
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E
> > > > > > > >
> > > > > > > > -------------------
> > > > > > > > Thank you,
> > > > > > > >
> > > > > > > > James Sirota
> > > > > > > > PMC- Apache Metron
> > > > > > > > jsirota AT apache DOT org
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Batch Profiler Feature Branch

Reply via email to