Re: [DISCUSS] Batch Profiler Feature Branch

James Sirota Thu, 27 Sep 2018 11:41:40 -0700

+1 from me as well. great work

27.09.2018, 11:15, "Ryan Merriman" <merrim...@gmail.com>:
> +1 from me. Great work.
>
> On Thu, Sep 27, 2018 at 12:41 PM Justin Leet <justinjl...@gmail.com> wrote:
>
>>  I'm +1 on merging the feature branch into master. There's a lot of good
>>  work here, and it's definitely been nice to see the couple remaining
>>  improvements make it in.
>>
>>  Thanks a lot for the contribution, this is great stuff!
>>
>>  On Wed, Sep 26, 2018 at 6:26 PM Nick Allen <n...@nickallen.org> wrote:
>>
>>  > Or support to be offered for merging this feature branch into master?
>>  >
>>  > On Wed, Sep 26, 2018 at 6:20 PM Nick Allen <n...@nickallen.org> wrote:
>>  >
>>  > > Thanks for the review. With
>>  https://github.com/apache/metron/pull/1209
>>  > complete,
>>  > > I think the feature branch is ready to be merged. Sounds like I have
>>  > > Mike's support. Anyone else have comments, concerns, questions?
>>  > >
>>  > > On Tue, Sep 25, 2018 at 10:33 PM Michael Miklavcic <
>>  > > michael.miklav...@gmail.com> wrote:
>>  > >
>>  > >> I just made a couple minor comments on that PR, and I am in agreement
>>  > >> about
>>  > >> the readiness for merging with master. Good stuff Nick.
>>  > >>
>>  > >> On Fri, Sep 21, 2018 at 12:37 PM Nick Allen <n...@nickallen.org>
>>  wrote:
>>  > >>
>>  > >> > Here is a PR that adds the input time constraints to the Batch
>>  > Profiler
>>  > >> > (METRON-1787); https://github.com/apache/metron/pull/1209.
>>  > >> >
>>  > >> > It seems that the consensus is that this is probably the last
>>  feature
>>  > we
>>  > >> > need before merging the FB into master. The other two can wait
>>  until
>>  > >> after
>>  > >> > the feature branch has been merged. Let me know if you disagree.
>>  > >> >
>>  > >> > Thanks
>>  > >> >
>>  > >> >
>>  > >> > On Thu, Sep 20, 2018 at 1:55 PM Nick Allen <n...@nickallen.org>
>>  > wrote:
>>  > >> >
>>  > >> > > Yeah, agreed. Per use case 3, when deploying to production there
>>  > >> really
>>  > >> > > wouldn't be a huge overlap like 3 months of already profiled data.
>>  > >> Its
>>  > >> > day
>>  > >> > > 1, the profile was just deployed around the same time as you are
>>  > >> running
>>  > >> > > the Batch Profiler, so the overlap is in minutes, maybe hours.
>>  But
>>  > I
>>  > >> can
>>  > >> > > definitely see the usefulness of the feature for re-runs, etc as
>>  you
>>  > >> have
>>  > >> > > described.
>>  > >> > >
>>  > >> > > Based on this discussion, I created a few JIRAs. Thanks all for
>>  the
>>  > >> > great
>>  > >> > > feedback and keep it coming.
>>  > >> > >
>>  > >> > > [1] METRON-1787 - Input Time Constraints for Batch Profiler
>>  > >> > > [2] METRON-1788 - Fetch Profile Definitions from Zk for Batch
>>  > Profiler
>>  > >> > > [3] METRON-1789 - MPack Should Define Default Input Path for Batch
>>  > >> > > Profiler
>>  > >> > >
>>  > >> > >
>>  > >> > > --
>>  > >> > > [1] https://issues.apache.org/jira/browse/METRON-1787
>>  > >> > > [2] https://issues.apache.org/jira/browse/METRON-1788
>>  > >> > > [3] https://issues.apache.org/jira/browse/METRON-1789
>>  > >> > >
>>  > >> > >
>>  > >> > >
>>  > >> > >
>>  > >> > >
>>  > >> > >
>>  > >> > > On Thu, Sep 20, 2018 at 1:34 PM Michael Miklavcic <
>>  > >> > > michael.miklav...@gmail.com> wrote:
>>  > >> > >
>>  > >> > >> I think we might want to allow the flexibility to choose the date
>>  > >> range
>>  > >> > >> then. I don't yet feel like I have a good enough understanding of
>>  > all
>>  > >> > the
>>  > >> > >> ways in which users would want to seed to force them to run the
>>  > batch
>>  > >> > job
>>  > >> > >> over all the data. It might also make it easier to deal with
>>  > >> > remediation,
>>  > >> > >> ie an error doesn't force you to re-run over the entire history.
>>  > Same
>>  > >> > goes
>>  > >> > >> for testing out the profile seeing batch job in the first place.
>>  > >> > >>
>>  > >> > >> On Thu, Sep 20, 2018 at 11:23 AM Nick Allen <n...@nickallen.org>
>>  > >> wrote:
>>  > >> > >>
>>  > >> > >> > Assuming you have 9 months of data archived, yes.
>>  > >> > >> >
>>  > >> > >> > On Thu, Sep 20, 2018 at 1:22 PM Michael Miklavcic <
>>  > >> > >> > michael.miklav...@gmail.com> wrote:
>>  > >> > >> >
>>  > >> > >> > > So in the case of 3 - if you had 6 months of data that hadn't
>>  > >> been
>>  > >> > >> > profiled
>>  > >> > >> > > and another 3 that had been profiled (9 months total data),
>>  in
>>  > >> its
>>  > >> > >> > current
>>  > >> > >> > > form the batch job runs over all 9 months?
>>  > >> > >> > >
>>  > >> > >> > > On Thu, Sep 20, 2018 at 11:13 AM Nick Allen <
>>  > n...@nickallen.org>
>>  > >> > >> wrote:
>>  > >> > >> > >
>>  > >> > >> > > > > How do we establish "tm" from 1.1 above? Any concerns
>>  about
>>  > >> > >> overlap
>>  > >> > >> > or
>>  > >> > >> > > > gaps after the seeding is performed?
>>  > >> > >> > > >
>>  > >> > >> > > > Good point. Right now, if the Streaming and Batch Profiler
>>  > >> > overlap
>>  > >> > >> the
>>  > >> > >> > > > last write wins. And presumably the output of the
>>  Streaming
>>  > >> and
>>  > >> > >> Batch
>>  > >> > >> > > > Profiler are the same, so no worries, right? :)
>>  > >> > >> > > >
>>  > >> > >> > > > So it kind of works, but it is definitely not ideal for use
>>  > >> case
>>  > >> > >> 3. I
>>  > >> > >> > > > could add --begin and --end args to constrain the time
>>  frame
>>  > >> over
>>  > >> > >> which
>>  > >> > >> > > the
>>  > >> > >> > > > Batch Profiler runs. I do not have that in the feature
>>  > branch.
>>  > >> > It
>>  > >> > >> > would
>>  > >> > >> > > > be easy enough to add though.
>>  > >> > >> > > >
>>  > >> > >> > > >
>>  > >> > >> > > >
>>  > >> > >> > > > On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic <
>>  > >> > >> > > > michael.miklav...@gmail.com> wrote:
>>  > >> > >> > > >
>>  > >> > >> > > > > Ok, makes sense. That's sort of what I was thinking as
>>  > well,
>>  > >> > Nick.
>>  > >> > >> > > > Pulling
>>  > >> > >> > > > > at this thread just a bit more...
>>  > >> > >> > > > >
>>  > >> > >> > > > > 1. I have an existing system that's been up a while,
>>  > and I
>>  > >> > have
>>  > >> > >> > > added
>>  > >> > >> > > > k
>>  > >> > >> > > > > profiles - assume these are the first profiles I've
>>  > >> created.
>>  > >> > >> > > > > 1. I would have t0 - tm (where m is the time when
>>  the
>>  > >> > >> profiles
>>  > >> > >> > > were
>>  > >> > >> > > > > first installed) worth of data that has not been
>>  > >> profiled
>>  > >> > >> yet.
>>  > >> > >> > > > > 2. The batch profiler process would be to take that
>>  > >> exact
>>  > >> > >> > profile
>>  > >> > >> > > > > definition from ZK and run the batch loader with
>>  that
>>  > >> from
>>  > >> > >> the
>>  > >> > >> > > CLI.
>>  > >> > >> > > > > 3. Profiles are now up to date from t0 - tCurrent
>>  > >> > >> > > > > 2. I've already done #1 above. Time goes by and now I
>>  > >> want to
>>  > >> > >> add
>>  > >> > >> > a
>>  > >> > >> > > > new
>>  > >> > >> > > > > profile.
>>  > >> > >> > > > > 1. Same first step above
>>  > >> > >> > > > > 2. I would run the batch loader with *only* that
>>  new
>>  > >> > profile
>>  > >> > >> > > > > definition to seed?
>>  > >> > >> > > > >
>>  > >> > >> > > > > Forgive me if I missed this in PR's and discussion in the
>>  > FB,
>>  > >> > but
>>  > >> > >> how
>>  > >> > >> > > do
>>  > >> > >> > > > we
>>  > >> > >> > > > > establish "tm" from 1.1 above? Any concerns about overlap
>>  > or
>>  > >> > gaps
>>  > >> > >> > after
>>  > >> > >> > > > the
>>  > >> > >> > > > > seeding is performed?
>>  > >> > >> > > > >
>>  > >> > >> > > > > On Thu, Sep 20, 2018 at 10:26 AM Nick Allen <
>>  > >> n...@nickallen.org
>>  > >> > >
>>  > >> > >> > > wrote:
>>  > >> > >> > > > >
>>  > >> > >> > > > > > I think more often than not, you would want to load
>>  your
>>  > >> > profile
>>  > >> > >> > > > > definition
>>  > >> > >> > > > > > from a file. This is why I considered the 'load from
>>  Zk'
>>  > >> more
>>  > >> > >> of a
>>  > >> > >> > > > > > nice-to-have.
>>  > >> > >> > > > > >
>>  > >> > >> > > > > > - In use case 1 and 2, this would definitely be the
>>  > >> case.
>>  > >> > >> The
>>  > >> > >> > > > > profiles
>>  > >> > >> > > > > > I am working with are speculative and I am using the
>>  > >> batch
>>  > >> > >> > > profiler
>>  > >> > >> > > > to
>>  > >> > >> > > > > > determine if they are worth keeping. In this case,
>>  my
>>  > >> > >> > speculative
>>  > >> > >> > > > > > profiles
>>  > >> > >> > > > > > would not be in Zk (yet).
>>  > >> > >> > > > > > - In use case 3, I could see it go either way. It
>>  > >> might be
>>  > >> > >> > useful
>>  > >> > >> > > > to
>>  > >> > >> > > > > > load from Zk, but it certainly isn't a blocker.
>>  > >> > >> > > > > >
>>  > >> > >> > > > > >
>>  > >> > >> > > > > > > So if the config does not correctly match the
>>  profiler
>>  > >> > config
>>  > >> > >> > held
>>  > >> > >> > > in
>>  > >> > >> > > > > ZK
>>  > >> > >> > > > > > and
>>  > >> > >> > > > > > the user runs the batch seeding job, what happens?
>>  > >> > >> > > > > >
>>  > >> > >> > > > > > You would just get a profile that is slightly different
>>  > >> over
>>  > >> > the
>>  > >> > >> > > entire
>>  > >> > >> > > > > > time span. This is not a new risk. If the user
>>  changes
>>  > >> their
>>  > >> > >> > > Profile
>>  > >> > >> > > > > > definitions in Zk, the same thing would happen.
>>  > >> > >> > > > > >
>>  > >> > >> > > > > >
>>  > >> > >> > > > > > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic <
>>  > >> > >> > > > > > michael.miklav...@gmail.com> wrote:
>>  > >> > >> > > > > >
>>  > >> > >> > > > > > > I think I'm torn on this, specifically because it's
>>  > batch
>>  > >> > and
>>  > >> > >> > would
>>  > >> > >> > > > > > > generally be run as-needed. Justin, can you elaborate
>>  > on
>>  > >> > your
>>  > >> > >> > > > concerns
>>  > >> > >> > > > > > > there? This feels functionally very similar to our
>>  flat
>>  > >> file
>>  > >> > >> > > loaders,
>>  > >> > >> > > > > > which
>>  > >> > >> > > > > > > all have inputs for config from the CLI only. On the
>>  > >> other
>>  > >> > >> hand,
>>  > >> > >> > > our
>>  > >> > >> > > > > flat
>>  > >> > >> > > > > > > file loaders are not typically seeding an existing
>>  > >> > structure.
>>  > >> > >> My
>>  > >> > >> > > > > concern
>>  > >> > >> > > > > > of
>>  > >> > >> > > > > > > a local file profiler config stems from this stated
>>  > goal:
>>  > >> > >> > > > > > > > The goal would be to enable “profile seeding” which
>>  > >> allows
>>  > >> > >> > > profiles
>>  > >> > >> > > > > to
>>  > >> > >> > > > > > be
>>  > >> > >> > > > > > > populated from a time before the profile was created.
>>  > >> > >> > > > > > > So if the config does not correctly match the
>>  profiler
>>  > >> > config
>>  > >> > >> > held
>>  > >> > >> > > in
>>  > >> > >> > > > > ZK
>>  > >> > >> > > > > > > and the user runs the batch seeding job, what
>>  happens?
>>  > >> > >> > > > > > >
>>  > >> > >> > > > > > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet <
>>  > >> > >> > > justinjl...@gmail.com>
>>  > >> > >> > > > > > > wrote:
>>  > >> > >> > > > > > >
>>  > >> > >> > > > > > > > The profile not being able to read from ZK feels
>>  > like a
>>  > >> > >> fairly
>>  > >> > >> > > > > > > substantial,
>>  > >> > >> > > > > > > > if subtle, set of potential problems. I'd like to
>>  > see
>>  > >> > that
>>  > >> > >> in
>>  > >> > >> > > > either
>>  > >> > >> > > > > > > > before merging or at least pretty soon after
>>  merging.
>>  > >> Is
>>  > >> > >> it a
>>  > >> > >> > > lot
>>  > >> > >> > > > of
>>  > >> > >> > > > > > > work
>>  > >> > >> > > > > > > > to add that functionality based on where things are
>>  > >> right
>>  > >> > >> now?
>>  > >> > >> > > > > > > >
>>  > >> > >> > > > > > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen <
>>  > >> > >> n...@nickallen.org
>>  > >> > >> > >
>>  > >> > >> > > > > wrote:
>>  > >> > >> > > > > > > >
>>  > >> > >> > > > > > > > > Here is another limitation that I just thought.
>>  It
>>  > >> can
>>  > >> > >> only
>>  > >> > >> > > read
>>  > >> > >> > > > a
>>  > >> > >> > > > > > > > profile
>>  > >> > >> > > > > > > > > definition from a file. It probably also makes
>>  > >> sense to
>>  > >> > >> add
>>  > >> > >> > an
>>  > >> > >> > > > > > option
>>  > >> > >> > > > > > > > that
>>  > >> > >> > > > > > > > > allows it to read the current Profiler
>>  > configuration
>>  > >> > from
>>  > >> > >> > > > > Zookeeper.
>>  > >> > >> > > > > > > > >
>>  > >> > >> > > > > > > > >
>>  > >> > >> > > > > > > > > > Is it worth setting up a default config that
>>  > pulls
>>  > >> > from
>>  > >> > >> the
>>  > >> > >> > > > main
>>  > >> > >> > > > > > > > indexing
>>  > >> > >> > > > > > > > > output?
>>  > >> > >> > > > > > > > >
>>  > >> > >> > > > > > > > > Yes, I think that makes sense. We want the Batch
>>  > >> > >> Profiler to
>>  > >> > >> > > > point
>>  > >> > >> > > > > > to
>>  > >> > >> > > > > > > > the
>>  > >> > >> > > > > > > > > right HDFS URL, no matter where/how Metron is
>>  > >> deployed.
>>  > >> > >> When
>>  > >> > >> > > > > Metron
>>  > >> > >> > > > > > > gets
>>  > >> > >> > > > > > > > > spun-up on a cluster, I should be able to just
>>  run
>>  > >> the
>>  > >> > >> Batch
>>  > >> > >> > > > > Profiler
>>  > >> > >> > > > > > > > > without having to fuss with the input path.
>>  > >> > >> > > > > > > > >
>>  > >> > >> > > > > > > > >
>>  > >> > >> > > > > > > > >
>>  > >> > >> > > > > > > > >
>>  > >> > >> > > > > > > > >
>>  > >> > >> > > > > > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet <
>>  > >> > >> > > > justinjl...@gmail.com
>>  > >> > >> > > > > >
>>  > >> > >> > > > > > > > wrote:
>>  > >> > >> > > > > > > > >
>>  > >> > >> > > > > > > > > > Re:
>>  > >> > >> > > > > > > > > >
>>  > >> > >> > > > > > > > > > > * You do not configure the Batch Profiler in
>>  > >> > >> Ambari. It
>>  > >> > >> > > is
>>  > >> > >> > > > > > > > configured
>>  > >> > >> > > > > > > > > > > and executed completely from the
>>  command-line.
>>  > >> > >> > > > > > > > > > >
>>  > >> > >> > > > > > > > > >
>>  > >> > >> > > > > > > > > > Is it worth setting up a default config that
>>  > pulls
>>  > >> > from
>>  > >> > >> the
>>  > >> > >> > > > main
>>  > >> > >> > > > > > > > indexing
>>  > >> > >> > > > > > > > > > output? I'm a little on the fence about it,
>>  but
>>  > it
>>  > >> > >> seems
>>  > >> > >> > > like
>>  > >> > >> > > > > > making
>>  > >> > >> > > > > > > > the
>>  > >> > >> > > > > > > > > > most common case more or less built-in would be
>>  > >> nice.
>>  > >> > >> > > > > > > > > >
>>  > >> > >> > > > > > > > > > Having said that, I do not consider that a
>>  > >> requirement
>>  > >> > >> for
>>  > >> > >> > > > > merging
>>  > >> > >> > > > > > > the
>>  > >> > >> > > > > > > > > > feature branch.
>>  > >> > >> > > > > > > > > >
>>  > >> > >> > > > > > > > > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota <
>>  > >> > >> > > > > jsir...@apache.org>
>>  > >> > >> > > > > > > > > wrote:
>>  > >> > >> > > > > > > > > >
>>  > >> > >> > > > > > > > > > > I think what you have outlined above is a
>>  good
>>  > >> > initial
>>  > >> > >> > stab
>>  > >> > >> > > > at
>>  > >> > >> > > > > > the
>>  > >> > >> > > > > > > > > > > feature. Manual install of spark is not a
>>  big
>>  > >> deal.
>>  > >> > >> > > > > Configuring
>>  > >> > >> > > > > > > via
>>  > >> > >> > > > > > > > > > > command line while we mature this feature is
>>  ok
>>  > >> as
>>  > >> > >> well.
>>  > >> > >> > > > > Doesn't
>>  > >> > >> > > > > > > > look
>>  > >> > >> > > > > > > > > > like
>>  > >> > >> > > > > > > > > > > configuration steps are too hard. I think
>>  you
>>  > >> > should
>>  > >> > >> > > merge.
>>  > >> > >> > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > James
>>  > >> > >> > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > 19.09.2018, 08:15, "Nick Allen" <
>>  > >> n...@nickallen.org
>>  > >> > >:
>>  > >> > >> > > > > > > > > > > > I would like to open a discussion to get
>>  the
>>  > >> Batch
>>  > >> > >> > > Profiler
>>  > >> > >> > > > > > > feature
>>  > >> > >> > > > > > > > > > > branch
>>  > >> > >> > > > > > > > > > > > merged into master as part of METRON-1699
>>  [1]
>>  > >> > Create
>>  > >> > >> > > Batch
>>  > >> > >> > > > > > > > Profiler.
>>  > >> > >> > > > > > > > > > All
>>  > >> > >> > > > > > > > > > > > of the work that I had in mind for our
>>  first
>>  > >> draft
>>  > >> > >> of
>>  > >> > >> > the
>>  > >> > >> > > > > Batch
>>  > >> > >> > > > > > > > > > Profiler
>>  > >> > >> > > > > > > > > > > > has been completed. Please take a look
>>  > through
>>  > >> > what
>>  > >> > >> I
>>  > >> > >> > > have
>>  > >> > >> > > > > and
>>  > >> > >> > > > > > > let
>>  > >> > >> > > > > > > > me
>>  > >> > >> > > > > > > > > > > know
>>  > >> > >> > > > > > > > > > > > if there are other features that you think
>>  > are
>>  > >> > >> required
>>  > >> > >> > > > > > *before*
>>  > >> > >> > > > > > > we
>>  > >> > >> > > > > > > > > > > merge.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > Previous list discussions on this topic
>>  > include
>>  > >> > [2]
>>  > >> > >> and
>>  > >> > >> > > > [3].
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > (Q) What can I do with the feature branch?
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * With the Batch Profiler, you can
>>  > >> backfill/seed
>>  > >> > >> > > profiles
>>  > >> > >> > > > > > using
>>  > >> > >> > > > > > > > > > > archived
>>  > >> > >> > > > > > > > > > > > telemetry. This enables the following types
>>  > of
>>  > >> use
>>  > >> > >> > cases.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > 1. As a Security Data Scientist, I
>>  want
>>  > >> to
>>  > >> > >> > > understand
>>  > >> > >> > > > > the
>>  > >> > >> > > > > > > > > > > historical
>>  > >> > >> > > > > > > > > > > > behaviors and trends of a profile that I
>>  have
>>  > >> > >> created
>>  > >> > >> > so
>>  > >> > >> > > > > that I
>>  > >> > >> > > > > > > can
>>  > >> > >> > > > > > > > > > > > determine if I have created a feature set
>>  > that
>>  > >> has
>>  > >> > >> > > > predictive
>>  > >> > >> > > > > > > value
>>  > >> > >> > > > > > > > > for
>>  > >> > >> > > > > > > > > > > > model building.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > 2. As a Security Data Scientist, I
>>  want
>>  > >> to
>>  > >> > >> > > understand
>>  > >> > >> > > > > the
>>  > >> > >> > > > > > > > > > > historical
>>  > >> > >> > > > > > > > > > > > behaviors and trends of a profile that I
>>  have
>>  > >> > >> created
>>  > >> > >> > so
>>  > >> > >> > > > > that I
>>  > >> > >> > > > > > > can
>>  > >> > >> > > > > > > > > > > > determine if I have defined the profile
>>  > >> correctly
>>  > >> > >> and
>>  > >> > >> > > > > created a
>>  > >> > >> > > > > > > > > feature
>>  > >> > >> > > > > > > > > > > set
>>  > >> > >> > > > > > > > > > > > that matches reality.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > 3. As a Security Platform Engineer, I
>>  > >> want
>>  > >> > to
>>  > >> > >> > > > generate
>>  > >> > >> > > > > a
>>  > >> > >> > > > > > > > > profile
>>  > >> > >> > > > > > > > > > > > using archived telemetry when I deploy a
>>  new
>>  > >> model
>>  > >> > >> to
>>  > >> > >> > > > > > production
>>  > >> > >> > > > > > > so
>>  > >> > >> > > > > > > > > > that
>>  > >> > >> > > > > > > > > > > > models depending on that profile can
>>  function
>>  > >> on
>>  > >> > >> day 1.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * METRON-1699 [1] includes a more
>>  detailed
>>  > >> > >> > description
>>  > >> > >> > > of
>>  > >> > >> > > > > the
>>  > >> > >> > > > > > > > > > feature.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > (Q) What work was completed?
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * The Batch Profiler runs on Spark and
>>  was
>>  > >> > >> > implemented
>>  > >> > >> > > in
>>  > >> > >> > > > > > Java
>>  > >> > >> > > > > > > to
>>  > >> > >> > > > > > > > > > > remain
>>  > >> > >> > > > > > > > > > > > consistent with our current Java-heavy code
>>  > >> base.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * The Batch Profiler is executed from the
>>  > >> > >> > command-line.
>>  > >> > >> > > > It
>>  > >> > >> > > > > > can
>>  > >> > >> > > > > > > be
>>  > >> > >> > > > > > > > > > > > launched using a script or by calling
>>  > >> > >> `spark-submit`,
>>  > >> > >> > > which
>>  > >> > >> > > > > may
>>  > >> > >> > > > > > > be
>>  > >> > >> > > > > > > > > > useful
>>  > >> > >> > > > > > > > > > > > for advanced users.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * Input telemetry can be consumed from
>>  > >> multiple
>>  > >> > >> > > sources;
>>  > >> > >> > > > > for
>>  > >> > >> > > > > > > > > example
>>  > >> > >> > > > > > > > > > > HDFS
>>  > >> > >> > > > > > > > > > > > or the local file system.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * Input telemetry can be consumed in
>>  > multiple
>>  > >> > >> > formats;
>>  > >> > >> > > > for
>>  > >> > >> > > > > > > > example
>>  > >> > >> > > > > > > > > > JSON
>>  > >> > >> > > > > > > > > > > > or ORC.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * The 'output' profile measurements are
>>  > >> > persisted
>>  > >> > >> in
>>  > >> > >> > > > HBase
>>  > >> > >> > > > > > and
>>  > >> > >> > > > > > > is
>>  > >> > >> > > > > > > > > > > > consistent with the Storm Profiler.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * It can be run on any underlying engine
>>  > >> > >> supported by
>>  > >> > >> > > > > Spark.
>>  > >> > >> > > > > > I
>>  > >> > >> > > > > > > > have
>>  > >> > >> > > > > > > > > > > > tested it both in 'local' mode and on a
>>  YARN
>>  > >> > >> cluster.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * It is installed automatically by the
>>  > Metron
>>  > >> > >> MPack.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * A README was added that documents usage
>>  > >> > >> > instructions.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * The existing Profiler code was
>>  refactored
>>  > >> so
>>  > >> > >> that
>>  > >> > >> > as
>>  > >> > >> > > > much
>>  > >> > >> > > > > > > code
>>  > >> > >> > > > > > > > as
>>  > >> > >> > > > > > > > > > > > possible is shared between the 3 Profiler
>>  > >> ports;
>>  > >> > >> Storm,
>>  > >> > >> > > the
>>  > >> > >> > > > > > > Stellar
>>  > >> > >> > > > > > > > > > REPL,
>>  > >> > >> > > > > > > > > > > > and Spark. For example, the logic which
>>  > >> determines
>>  > >> > >> the
>>  > >> > >> > > > > > timestamp
>>  > >> > >> > > > > > > > of a
>>  > >> > >> > > > > > > > > > > > message was refactored so that it could be
>>  > >> reused
>>  > >> > by
>>  > >> > >> > all
>>  > >> > >> > > > > ports.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * metron-profiler-common: The common
>>  > >> > Profiler
>>  > >> > >> > code
>>  > >> > >> > > > > shared
>>  > >> > >> > > > > > > > > amongst
>>  > >> > >> > > > > > > > > > > > each port.
>>  > >> > >> > > > > > > > > > > > * metron-profiler-storm: Profiler on
>>  > >> Storm
>>  > >> > >> > > > > > > > > > > > * metron-profiler-spark: Profiler on
>>  > >> Spark
>>  > >> > >> > > > > > > > > > > > * metron-profiler-repl: Profiler on
>>  the
>>  > >> > >> Stellar
>>  > >> > >> > > REPL
>>  > >> > >> > > > > > > > > > > > * metron-profiler-client: The client
>>  > code
>>  > >> > for
>>  > >> > >> > > > > retrieving
>>  > >> > >> > > > > > > > > profile
>>  > >> > >> > > > > > > > > > > > data; for example PROFILE_GET.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * There are 3 separate RPM and DEB
>>  packages
>>  > >> now
>>  > >> > >> > created
>>  > >> > >> > > > for
>>  > >> > >> > > > > > the
>>  > >> > >> > > > > > > > > > > Profiler.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * metron-profiler-storm-*.rpm
>>  > >> > >> > > > > > > > > > > > * metron-profiler-spark-*.rpm
>>  > >> > >> > > > > > > > > > > > * metron-profiler-repl-*.rpm
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * The Profiler integration tests were
>>  > >> enhanced
>>  > >> > to
>>  > >> > >> > > > leverage
>>  > >> > >> > > > > > the
>>  > >> > >> > > > > > > > > > Profiler
>>  > >> > >> > > > > > > > > > > > Client logic to validate the results.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * Review METRON-1699 [1] for a complete
>>  > >> > >> break-down of
>>  > >> > >> > > the
>>  > >> > >> > > > > > tasks
>>  > >> > >> > > > > > > > > that
>>  > >> > >> > > > > > > > > > > have
>>  > >> > >> > > > > > > > > > > > been completed on the feature branch.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > (Q) What limitations exist?
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * You must manually install Spark to use
>>  > the
>>  > >> > Batch
>>  > >> > >> > > > > Profiler.
>>  > >> > >> > > > > > > The
>>  > >> > >> > > > > > > > > > Metron
>>  > >> > >> > > > > > > > > > > > MPack does not treat Spark as a Metron
>>  > >> dependency
>>  > >> > >> and
>>  > >> > >> > so
>>  > >> > >> > > > does
>>  > >> > >> > > > > > not
>>  > >> > >> > > > > > > > > > install
>>  > >> > >> > > > > > > > > > > > it automatically.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * You do not configure the Batch Profiler
>>  > in
>>  > >> > >> Ambari.
>>  > >> > >> > It
>>  > >> > >> > > > is
>>  > >> > >> > > > > > > > > configured
>>  > >> > >> > > > > > > > > > > > and executed completely from the
>>  > command-line.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > * To run the Batch Profiler in 'Full
>>  Dev',
>>  > >> you
>>  > >> > >> have
>>  > >> > >> > to
>>  > >> > >> > > > take
>>  > >> > >> > > > > > the
>>  > >> > >> > > > > > > > > > > following
>>  > >> > >> > > > > > > > > > > > manual steps. Some of these are arguably
>>  > >> > limitations
>>  > >> > >> > with
>>  > >> > >> > > > how
>>  > >> > >> > > > > > > > Ambari
>>  > >> > >> > > > > > > > > > > > installs Spark 2 in the version of HDP that
>>  > we
>>  > >> > run.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > 1. Install Spark 2 using Ambari.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > 2. Tell Spark how to talk with HBase.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > >
>>  > >> SPARK_HOME=/usr/hdp/current/spark2-client
>>  > >> > >> > > > > > > > > > > > cp
>>  > >> > >> > > > /usr/hdp/current/hbase-client/conf/hbase-site.xml
>>  > >> > >> > > > > > > > > > > > $SPARK_HOME/conf/
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > 3. Create the Spark History directory
>>  > in
>>  > >> > HDFS.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > export HADOOP_USER_NAME=hdfs
>>  > >> > >> > > > > > > > > > > > hdfs dfs -mkdir /spark2-history
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > 4. Change the default input path to
>>  > >> > >> > > > > > > > `hdfs://localhost:8020/...`
>>  > >> > >> > > > > > > > > > to
>>  > >> > >> > > > > > > > > > > > match the port defined by HDP, instead of
>>  > port
>>  > >> > 9000.
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > > [1]
>>  > >> > >> https://issues.apache.org/jira/browse/METRON-1699
>>  > >> > >> > > > > > > > > > > > [2]
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > >
>>  > >> > >> > > > > > > > > >
>>  > >> > >> > > > > > > > >
>>  > >> > >> > > > > > > >
>>  > >> > >> > > > > > >
>>  > >> > >> > > > > >
>>  > >> > >> > > > >
>>  > >> > >> > > >
>>  > >> > >> > >
>>  > >> > >> >
>>  > >> > >>
>>  > >> >
>>  > >>
>>  >
>>  
>> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
>>  > >> > >> > > > > > > > > > > > [3]
>>  > >> > >> > > > > > > > > > > >
>>  > >> > >> > > > > > > > > > >
>>  > >> > >> > > > > > > > > >
>>  > >> > >> > > > > > > > >
>>  > >> > >> > > > > > > >
>>  > >> > >> > > > > > >
>>  > >> > >> > > > > >
>>  > >> > >> > > > >
>>  > >> > >> > > >
>>  > >> > >> > >
>>  > >> > >> >
>>  > >> > >>
>>  > >> >
>>  > >>
>>  >
>>  
>> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E
>>  > >> > >> > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > -------------------
>>  > >> > >> > > > > > > > > > > Thank you,
>>  > >> > >> > > > > > > > > > >
>>  > >> > >> > > > > > > > > > > James Sirota
>>  > >> > >> > > > > > > > > > > PMC- Apache Metron
>>  > >> > >> > > > > > > > > > > jsirota AT apache DOT org
>>  > >> > >> > > > > > > > > > >
>>  > >> > >> > > > > > > > > > >
>>  > >> > >> > > > > > > > > >
>>  > >> > >> > > > > > > > >
>>  > >> > >> > > > > > > >
>>  > >> > >> > > > > > >
>>  > >> > >> > > > > >
>>  > >> > >> > > > >
>>  > >> > >> > > >
>>  > >> > >> > >
>>  > >> > >> >
>>  > >> > >>
>>  > >> > >
>>  > >> >
>>  > >>
>>  > >
>>  >


------------------- 
Thank you,

James Sirota
PMC- Apache Metron
jsirota AT apache DOT org

Re: [DISCUSS] Batch Profiler Feature Branch

Reply via email to