Thanks for all the reviews and support. I have merged the feature branch into master.
On Thu, Sep 27, 2018 at 2:41 PM James Sirota <jsir...@apache.org> wrote: > +1 from me as well. great work > > 27.09.2018, 11:15, "Ryan Merriman" <merrim...@gmail.com>: > > +1 from me. Great work. > > > > On Thu, Sep 27, 2018 at 12:41 PM Justin Leet <justinjl...@gmail.com> > wrote: > > > >> I'm +1 on merging the feature branch into master. There's a lot of good > >> work here, and it's definitely been nice to see the couple remaining > >> improvements make it in. > >> > >> Thanks a lot for the contribution, this is great stuff! > >> > >> On Wed, Sep 26, 2018 at 6:26 PM Nick Allen <n...@nickallen.org> wrote: > >> > >> > Or support to be offered for merging this feature branch into master? > >> > > >> > On Wed, Sep 26, 2018 at 6:20 PM Nick Allen <n...@nickallen.org> > wrote: > >> > > >> > > Thanks for the review. With > >> https://github.com/apache/metron/pull/1209 > >> > complete, > >> > > I think the feature branch is ready to be merged. Sounds like I > have > >> > > Mike's support. Anyone else have comments, concerns, questions? > >> > > > >> > > On Tue, Sep 25, 2018 at 10:33 PM Michael Miklavcic < > >> > > michael.miklav...@gmail.com> wrote: > >> > > > >> > >> I just made a couple minor comments on that PR, and I am in > agreement > >> > >> about > >> > >> the readiness for merging with master. Good stuff Nick. > >> > >> > >> > >> On Fri, Sep 21, 2018 at 12:37 PM Nick Allen <n...@nickallen.org> > >> wrote: > >> > >> > >> > >> > Here is a PR that adds the input time constraints to the Batch > >> > Profiler > >> > >> > (METRON-1787); https://github.com/apache/metron/pull/1209. > >> > >> > > >> > >> > It seems that the consensus is that this is probably the last > >> feature > >> > we > >> > >> > need before merging the FB into master. The other two can wait > >> until > >> > >> after > >> > >> > the feature branch has been merged. Let me know if you disagree. > >> > >> > > >> > >> > Thanks > >> > >> > > >> > >> > > >> > >> > On Thu, Sep 20, 2018 at 1:55 PM Nick Allen <n...@nickallen.org> > >> > wrote: > >> > >> > > >> > >> > > Yeah, agreed. Per use case 3, when deploying to production > there > >> > >> really > >> > >> > > wouldn't be a huge overlap like 3 months of already profiled > data. > >> > >> Its > >> > >> > day > >> > >> > > 1, the profile was just deployed around the same time as you > are > >> > >> running > >> > >> > > the Batch Profiler, so the overlap is in minutes, maybe hours. > >> But > >> > I > >> > >> can > >> > >> > > definitely see the usefulness of the feature for re-runs, etc > as > >> you > >> > >> have > >> > >> > > described. > >> > >> > > > >> > >> > > Based on this discussion, I created a few JIRAs. Thanks all > for > >> the > >> > >> > great > >> > >> > > feedback and keep it coming. > >> > >> > > > >> > >> > > [1] METRON-1787 - Input Time Constraints for Batch Profiler > >> > >> > > [2] METRON-1788 - Fetch Profile Definitions from Zk for Batch > >> > Profiler > >> > >> > > [3] METRON-1789 - MPack Should Define Default Input Path for > Batch > >> > >> > > Profiler > >> > >> > > > >> > >> > > > >> > >> > > -- > >> > >> > > [1] https://issues.apache.org/jira/browse/METRON-1787 > >> > >> > > [2] https://issues.apache.org/jira/browse/METRON-1788 > >> > >> > > [3] https://issues.apache.org/jira/browse/METRON-1789 > >> > >> > > > >> > >> > > > >> > >> > > > >> > >> > > > >> > >> > > > >> > >> > > > >> > >> > > On Thu, Sep 20, 2018 at 1:34 PM Michael Miklavcic < > >> > >> > > michael.miklav...@gmail.com> wrote: > >> > >> > > > >> > >> > >> I think we might want to allow the flexibility to choose the > date > >> > >> range > >> > >> > >> then. I don't yet feel like I have a good enough > understanding of > >> > all > >> > >> > the > >> > >> > >> ways in which users would want to seed to force them to run > the > >> > batch > >> > >> > job > >> > >> > >> over all the data. It might also make it easier to deal with > >> > >> > remediation, > >> > >> > >> ie an error doesn't force you to re-run over the entire > history. > >> > Same > >> > >> > goes > >> > >> > >> for testing out the profile seeing batch job in the first > place. > >> > >> > >> > >> > >> > >> On Thu, Sep 20, 2018 at 11:23 AM Nick Allen < > n...@nickallen.org> > >> > >> wrote: > >> > >> > >> > >> > >> > >> > Assuming you have 9 months of data archived, yes. > >> > >> > >> > > >> > >> > >> > On Thu, Sep 20, 2018 at 1:22 PM Michael Miklavcic < > >> > >> > >> > michael.miklav...@gmail.com> wrote: > >> > >> > >> > > >> > >> > >> > > So in the case of 3 - if you had 6 months of data that > hadn't > >> > >> been > >> > >> > >> > profiled > >> > >> > >> > > and another 3 that had been profiled (9 months total > data), > >> in > >> > >> its > >> > >> > >> > current > >> > >> > >> > > form the batch job runs over all 9 months? > >> > >> > >> > > > >> > >> > >> > > On Thu, Sep 20, 2018 at 11:13 AM Nick Allen < > >> > n...@nickallen.org> > >> > >> > >> wrote: > >> > >> > >> > > > >> > >> > >> > > > > How do we establish "tm" from 1.1 above? Any concerns > >> about > >> > >> > >> overlap > >> > >> > >> > or > >> > >> > >> > > > gaps after the seeding is performed? > >> > >> > >> > > > > >> > >> > >> > > > Good point. Right now, if the Streaming and Batch > Profiler > >> > >> > overlap > >> > >> > >> the > >> > >> > >> > > > last write wins. And presumably the output of the > >> Streaming > >> > >> and > >> > >> > >> Batch > >> > >> > >> > > > Profiler are the same, so no worries, right? :) > >> > >> > >> > > > > >> > >> > >> > > > So it kind of works, but it is definitely not ideal > for use > >> > >> case > >> > >> > >> 3. I > >> > >> > >> > > > could add --begin and --end args to constrain the time > >> frame > >> > >> over > >> > >> > >> which > >> > >> > >> > > the > >> > >> > >> > > > Batch Profiler runs. I do not have that in the feature > >> > branch. > >> > >> > It > >> > >> > >> > would > >> > >> > >> > > > be easy enough to add though. > >> > >> > >> > > > > >> > >> > >> > > > > >> > >> > >> > > > > >> > >> > >> > > > On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic < > >> > >> > >> > > > michael.miklav...@gmail.com> wrote: > >> > >> > >> > > > > >> > >> > >> > > > > Ok, makes sense. That's sort of what I was thinking > as > >> > well, > >> > >> > Nick. > >> > >> > >> > > > Pulling > >> > >> > >> > > > > at this thread just a bit more... > >> > >> > >> > > > > > >> > >> > >> > > > > 1. I have an existing system that's been up a while, > >> > and I > >> > >> > have > >> > >> > >> > > added > >> > >> > >> > > > k > >> > >> > >> > > > > profiles - assume these are the first profiles I've > >> > >> created. > >> > >> > >> > > > > 1. I would have t0 - tm (where m is the time when > >> the > >> > >> > >> profiles > >> > >> > >> > > were > >> > >> > >> > > > > first installed) worth of data that has not been > >> > >> profiled > >> > >> > >> yet. > >> > >> > >> > > > > 2. The batch profiler process would be to take that > >> > >> exact > >> > >> > >> > profile > >> > >> > >> > > > > definition from ZK and run the batch loader with > >> that > >> > >> from > >> > >> > >> the > >> > >> > >> > > CLI. > >> > >> > >> > > > > 3. Profiles are now up to date from t0 - tCurrent > >> > >> > >> > > > > 2. I've already done #1 above. Time goes by and now I > >> > >> want to > >> > >> > >> add > >> > >> > >> > a > >> > >> > >> > > > new > >> > >> > >> > > > > profile. > >> > >> > >> > > > > 1. Same first step above > >> > >> > >> > > > > 2. I would run the batch loader with *only* that > >> new > >> > >> > profile > >> > >> > >> > > > > definition to seed? > >> > >> > >> > > > > > >> > >> > >> > > > > Forgive me if I missed this in PR's and discussion > in the > >> > FB, > >> > >> > but > >> > >> > >> how > >> > >> > >> > > do > >> > >> > >> > > > we > >> > >> > >> > > > > establish "tm" from 1.1 above? Any concerns about > overlap > >> > or > >> > >> > gaps > >> > >> > >> > after > >> > >> > >> > > > the > >> > >> > >> > > > > seeding is performed? > >> > >> > >> > > > > > >> > >> > >> > > > > On Thu, Sep 20, 2018 at 10:26 AM Nick Allen < > >> > >> n...@nickallen.org > >> > >> > > > >> > >> > >> > > wrote: > >> > >> > >> > > > > > >> > >> > >> > > > > > I think more often than not, you would want to load > >> your > >> > >> > profile > >> > >> > >> > > > > definition > >> > >> > >> > > > > > from a file. This is why I considered the 'load > from > >> Zk' > >> > >> more > >> > >> > >> of a > >> > >> > >> > > > > > nice-to-have. > >> > >> > >> > > > > > > >> > >> > >> > > > > > - In use case 1 and 2, this would definitely be the > >> > >> case. > >> > >> > >> The > >> > >> > >> > > > > profiles > >> > >> > >> > > > > > I am working with are speculative and I am using > the > >> > >> batch > >> > >> > >> > > profiler > >> > >> > >> > > > to > >> > >> > >> > > > > > determine if they are worth keeping. In this case, > >> my > >> > >> > >> > speculative > >> > >> > >> > > > > > profiles > >> > >> > >> > > > > > would not be in Zk (yet). > >> > >> > >> > > > > > - In use case 3, I could see it go either way. It > >> > >> might be > >> > >> > >> > useful > >> > >> > >> > > > to > >> > >> > >> > > > > > load from Zk, but it certainly isn't a blocker. > >> > >> > >> > > > > > > >> > >> > >> > > > > > > >> > >> > >> > > > > > > So if the config does not correctly match the > >> profiler > >> > >> > config > >> > >> > >> > held > >> > >> > >> > > in > >> > >> > >> > > > > ZK > >> > >> > >> > > > > > and > >> > >> > >> > > > > > the user runs the batch seeding job, what happens? > >> > >> > >> > > > > > > >> > >> > >> > > > > > You would just get a profile that is slightly > different > >> > >> over > >> > >> > the > >> > >> > >> > > entire > >> > >> > >> > > > > > time span. This is not a new risk. If the user > >> changes > >> > >> their > >> > >> > >> > > Profile > >> > >> > >> > > > > > definitions in Zk, the same thing would happen. > >> > >> > >> > > > > > > >> > >> > >> > > > > > > >> > >> > >> > > > > > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic > < > >> > >> > >> > > > > > michael.miklav...@gmail.com> wrote: > >> > >> > >> > > > > > > >> > >> > >> > > > > > > I think I'm torn on this, specifically because > it's > >> > batch > >> > >> > and > >> > >> > >> > would > >> > >> > >> > > > > > > generally be run as-needed. Justin, can you > elaborate > >> > on > >> > >> > your > >> > >> > >> > > > concerns > >> > >> > >> > > > > > > there? This feels functionally very similar to > our > >> flat > >> > >> file > >> > >> > >> > > loaders, > >> > >> > >> > > > > > which > >> > >> > >> > > > > > > all have inputs for config from the CLI only. On > the > >> > >> other > >> > >> > >> hand, > >> > >> > >> > > our > >> > >> > >> > > > > flat > >> > >> > >> > > > > > > file loaders are not typically seeding an > existing > >> > >> > structure. > >> > >> > >> My > >> > >> > >> > > > > concern > >> > >> > >> > > > > > of > >> > >> > >> > > > > > > a local file profiler config stems from this > stated > >> > goal: > >> > >> > >> > > > > > > > The goal would be to enable “profile seeding” > which > >> > >> allows > >> > >> > >> > > profiles > >> > >> > >> > > > > to > >> > >> > >> > > > > > be > >> > >> > >> > > > > > > populated from a time before the profile was > created. > >> > >> > >> > > > > > > So if the config does not correctly match the > >> profiler > >> > >> > config > >> > >> > >> > held > >> > >> > >> > > in > >> > >> > >> > > > > ZK > >> > >> > >> > > > > > > and the user runs the batch seeding job, what > >> happens? > >> > >> > >> > > > > > > > >> > >> > >> > > > > > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet < > >> > >> > >> > > justinjl...@gmail.com> > >> > >> > >> > > > > > > wrote: > >> > >> > >> > > > > > > > >> > >> > >> > > > > > > > The profile not being able to read from ZK > feels > >> > like a > >> > >> > >> fairly > >> > >> > >> > > > > > > substantial, > >> > >> > >> > > > > > > > if subtle, set of potential problems. I'd like > to > >> > see > >> > >> > that > >> > >> > >> in > >> > >> > >> > > > either > >> > >> > >> > > > > > > > before merging or at least pretty soon after > >> merging. > >> > >> Is > >> > >> > >> it a > >> > >> > >> > > lot > >> > >> > >> > > > of > >> > >> > >> > > > > > > work > >> > >> > >> > > > > > > > to add that functionality based on where > things are > >> > >> right > >> > >> > >> now? > >> > >> > >> > > > > > > > > >> > >> > >> > > > > > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen < > >> > >> > >> n...@nickallen.org > >> > >> > >> > > > >> > >> > >> > > > > wrote: > >> > >> > >> > > > > > > > > >> > >> > >> > > > > > > > > Here is another limitation that I just > thought. > >> It > >> > >> can > >> > >> > >> only > >> > >> > >> > > read > >> > >> > >> > > > a > >> > >> > >> > > > > > > > profile > >> > >> > >> > > > > > > > > definition from a file. It probably also > makes > >> > >> sense to > >> > >> > >> add > >> > >> > >> > an > >> > >> > >> > > > > > option > >> > >> > >> > > > > > > > that > >> > >> > >> > > > > > > > > allows it to read the current Profiler > >> > configuration > >> > >> > from > >> > >> > >> > > > > Zookeeper. > >> > >> > >> > > > > > > > > > >> > >> > >> > > > > > > > > > >> > >> > >> > > > > > > > > > Is it worth setting up a default config > that > >> > pulls > >> > >> > from > >> > >> > >> the > >> > >> > >> > > > main > >> > >> > >> > > > > > > > indexing > >> > >> > >> > > > > > > > > output? > >> > >> > >> > > > > > > > > > >> > >> > >> > > > > > > > > Yes, I think that makes sense. We want the > Batch > >> > >> > >> Profiler to > >> > >> > >> > > > point > >> > >> > >> > > > > > to > >> > >> > >> > > > > > > > the > >> > >> > >> > > > > > > > > right HDFS URL, no matter where/how Metron is > >> > >> deployed. > >> > >> > >> When > >> > >> > >> > > > > Metron > >> > >> > >> > > > > > > gets > >> > >> > >> > > > > > > > > spun-up on a cluster, I should be able to > just > >> run > >> > >> the > >> > >> > >> Batch > >> > >> > >> > > > > Profiler > >> > >> > >> > > > > > > > > without having to fuss with the input path. > >> > >> > >> > > > > > > > > > >> > >> > >> > > > > > > > > > >> > >> > >> > > > > > > > > > >> > >> > >> > > > > > > > > > >> > >> > >> > > > > > > > > > >> > >> > >> > > > > > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet < > >> > >> > >> > > > justinjl...@gmail.com > >> > >> > >> > > > > > > >> > >> > >> > > > > > > > wrote: > >> > >> > >> > > > > > > > > > >> > >> > >> > > > > > > > > > Re: > >> > >> > >> > > > > > > > > > > >> > >> > >> > > > > > > > > > > * You do not configure the Batch > Profiler in > >> > >> > >> Ambari. It > >> > >> > >> > > is > >> > >> > >> > > > > > > > configured > >> > >> > >> > > > > > > > > > > and executed completely from the > >> command-line. > >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > >> > >> > >> > > > > > > > > > Is it worth setting up a default config > that > >> > pulls > >> > >> > from > >> > >> > >> the > >> > >> > >> > > > main > >> > >> > >> > > > > > > > indexing > >> > >> > >> > > > > > > > > > output? I'm a little on the fence about it, > >> but > >> > it > >> > >> > >> seems > >> > >> > >> > > like > >> > >> > >> > > > > > making > >> > >> > >> > > > > > > > the > >> > >> > >> > > > > > > > > > most common case more or less built-in > would be > >> > >> nice. > >> > >> > >> > > > > > > > > > > >> > >> > >> > > > > > > > > > Having said that, I do not consider that a > >> > >> requirement > >> > >> > >> for > >> > >> > >> > > > > merging > >> > >> > >> > > > > > > the > >> > >> > >> > > > > > > > > > feature branch. > >> > >> > >> > > > > > > > > > > >> > >> > >> > > > > > > > > > On Wed, Sep 19, 2018 at 11:23 AM James > Sirota < > >> > >> > >> > > > > jsir...@apache.org> > >> > >> > >> > > > > > > > > wrote: > >> > >> > >> > > > > > > > > > > >> > >> > >> > > > > > > > > > > I think what you have outlined above is a > >> good > >> > >> > initial > >> > >> > >> > stab > >> > >> > >> > > > at > >> > >> > >> > > > > > the > >> > >> > >> > > > > > > > > > > feature. Manual install of spark is not a > >> big > >> > >> deal. > >> > >> > >> > > > > Configuring > >> > >> > >> > > > > > > via > >> > >> > >> > > > > > > > > > > command line while we mature this > feature is > >> ok > >> > >> as > >> > >> > >> well. > >> > >> > >> > > > > Doesn't > >> > >> > >> > > > > > > > look > >> > >> > >> > > > > > > > > > like > >> > >> > >> > > > > > > > > > > configuration steps are too hard. I think > >> you > >> > >> > should > >> > >> > >> > > merge. > >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > James > >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > 19.09.2018, 08:15, "Nick Allen" < > >> > >> n...@nickallen.org > >> > >> > >: > >> > >> > >> > > > > > > > > > > > I would like to open a discussion to > get > >> the > >> > >> Batch > >> > >> > >> > > Profiler > >> > >> > >> > > > > > > feature > >> > >> > >> > > > > > > > > > > branch > >> > >> > >> > > > > > > > > > > > merged into master as part of > METRON-1699 > >> [1] > >> > >> > Create > >> > >> > >> > > Batch > >> > >> > >> > > > > > > > Profiler. > >> > >> > >> > > > > > > > > > All > >> > >> > >> > > > > > > > > > > > of the work that I had in mind for our > >> first > >> > >> draft > >> > >> > >> of > >> > >> > >> > the > >> > >> > >> > > > > Batch > >> > >> > >> > > > > > > > > > Profiler > >> > >> > >> > > > > > > > > > > > has been completed. Please take a look > >> > through > >> > >> > what > >> > >> > >> I > >> > >> > >> > > have > >> > >> > >> > > > > and > >> > >> > >> > > > > > > let > >> > >> > >> > > > > > > > me > >> > >> > >> > > > > > > > > > > know > >> > >> > >> > > > > > > > > > > > if there are other features that you > think > >> > are > >> > >> > >> required > >> > >> > >> > > > > > *before* > >> > >> > >> > > > > > > we > >> > >> > >> > > > > > > > > > > merge. > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > Previous list discussions on this topic > >> > include > >> > >> > [2] > >> > >> > >> and > >> > >> > >> > > > [3]. > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > (Q) What can I do with the feature > branch? > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * With the Batch Profiler, you can > >> > >> backfill/seed > >> > >> > >> > > profiles > >> > >> > >> > > > > > using > >> > >> > >> > > > > > > > > > > archived > >> > >> > >> > > > > > > > > > > > telemetry. This enables the following > types > >> > of > >> > >> use > >> > >> > >> > cases. > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > 1. As a Security Data Scientist, I > >> want > >> > >> to > >> > >> > >> > > understand > >> > >> > >> > > > > the > >> > >> > >> > > > > > > > > > > historical > >> > >> > >> > > > > > > > > > > > behaviors and trends of a profile that > I > >> have > >> > >> > >> created > >> > >> > >> > so > >> > >> > >> > > > > that I > >> > >> > >> > > > > > > can > >> > >> > >> > > > > > > > > > > > determine if I have created a feature > set > >> > that > >> > >> has > >> > >> > >> > > > predictive > >> > >> > >> > > > > > > value > >> > >> > >> > > > > > > > > for > >> > >> > >> > > > > > > > > > > > model building. > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > 2. As a Security Data Scientist, I > >> want > >> > >> to > >> > >> > >> > > understand > >> > >> > >> > > > > the > >> > >> > >> > > > > > > > > > > historical > >> > >> > >> > > > > > > > > > > > behaviors and trends of a profile that > I > >> have > >> > >> > >> created > >> > >> > >> > so > >> > >> > >> > > > > that I > >> > >> > >> > > > > > > can > >> > >> > >> > > > > > > > > > > > determine if I have defined the profile > >> > >> correctly > >> > >> > >> and > >> > >> > >> > > > > created a > >> > >> > >> > > > > > > > > feature > >> > >> > >> > > > > > > > > > > set > >> > >> > >> > > > > > > > > > > > that matches reality. > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > 3. As a Security Platform Engineer, I > >> > >> want > >> > >> > to > >> > >> > >> > > > generate > >> > >> > >> > > > > a > >> > >> > >> > > > > > > > > profile > >> > >> > >> > > > > > > > > > > > using archived telemetry when I deploy > a > >> new > >> > >> model > >> > >> > >> to > >> > >> > >> > > > > > production > >> > >> > >> > > > > > > so > >> > >> > >> > > > > > > > > > that > >> > >> > >> > > > > > > > > > > > models depending on that profile can > >> function > >> > >> on > >> > >> > >> day 1. > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * METRON-1699 [1] includes a more > >> detailed > >> > >> > >> > description > >> > >> > >> > > of > >> > >> > >> > > > > the > >> > >> > >> > > > > > > > > > feature. > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > (Q) What work was completed? > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * The Batch Profiler runs on Spark and > >> was > >> > >> > >> > implemented > >> > >> > >> > > in > >> > >> > >> > > > > > Java > >> > >> > >> > > > > > > to > >> > >> > >> > > > > > > > > > > remain > >> > >> > >> > > > > > > > > > > > consistent with our current Java-heavy > code > >> > >> base. > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * The Batch Profiler is executed from > the > >> > >> > >> > command-line. > >> > >> > >> > > > It > >> > >> > >> > > > > > can > >> > >> > >> > > > > > > be > >> > >> > >> > > > > > > > > > > > launched using a script or by calling > >> > >> > >> `spark-submit`, > >> > >> > >> > > which > >> > >> > >> > > > > may > >> > >> > >> > > > > > > be > >> > >> > >> > > > > > > > > > useful > >> > >> > >> > > > > > > > > > > > for advanced users. > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * Input telemetry can be consumed from > >> > >> multiple > >> > >> > >> > > sources; > >> > >> > >> > > > > for > >> > >> > >> > > > > > > > > example > >> > >> > >> > > > > > > > > > > HDFS > >> > >> > >> > > > > > > > > > > > or the local file system. > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * Input telemetry can be consumed in > >> > multiple > >> > >> > >> > formats; > >> > >> > >> > > > for > >> > >> > >> > > > > > > > example > >> > >> > >> > > > > > > > > > JSON > >> > >> > >> > > > > > > > > > > > or ORC. > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * The 'output' profile measurements are > >> > >> > persisted > >> > >> > >> in > >> > >> > >> > > > HBase > >> > >> > >> > > > > > and > >> > >> > >> > > > > > > is > >> > >> > >> > > > > > > > > > > > consistent with the Storm Profiler. > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * It can be run on any underlying > engine > >> > >> > >> supported by > >> > >> > >> > > > > Spark. > >> > >> > >> > > > > > I > >> > >> > >> > > > > > > > have > >> > >> > >> > > > > > > > > > > > tested it both in 'local' mode and on a > >> YARN > >> > >> > >> cluster. > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * It is installed automatically by the > >> > Metron > >> > >> > >> MPack. > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * A README was added that documents > usage > >> > >> > >> > instructions. > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * The existing Profiler code was > >> refactored > >> > >> so > >> > >> > >> that > >> > >> > >> > as > >> > >> > >> > > > much > >> > >> > >> > > > > > > code > >> > >> > >> > > > > > > > as > >> > >> > >> > > > > > > > > > > > possible is shared between the 3 > Profiler > >> > >> ports; > >> > >> > >> Storm, > >> > >> > >> > > the > >> > >> > >> > > > > > > Stellar > >> > >> > >> > > > > > > > > > REPL, > >> > >> > >> > > > > > > > > > > > and Spark. For example, the logic which > >> > >> determines > >> > >> > >> the > >> > >> > >> > > > > > timestamp > >> > >> > >> > > > > > > > of a > >> > >> > >> > > > > > > > > > > > message was refactored so that it > could be > >> > >> reused > >> > >> > by > >> > >> > >> > all > >> > >> > >> > > > > ports. > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * metron-profiler-common: The common > >> > >> > Profiler > >> > >> > >> > code > >> > >> > >> > > > > shared > >> > >> > >> > > > > > > > > amongst > >> > >> > >> > > > > > > > > > > > each port. > >> > >> > >> > > > > > > > > > > > * metron-profiler-storm: Profiler on > >> > >> Storm > >> > >> > >> > > > > > > > > > > > * metron-profiler-spark: Profiler on > >> > >> Spark > >> > >> > >> > > > > > > > > > > > * metron-profiler-repl: Profiler on > >> the > >> > >> > >> Stellar > >> > >> > >> > > REPL > >> > >> > >> > > > > > > > > > > > * metron-profiler-client: The client > >> > code > >> > >> > for > >> > >> > >> > > > > retrieving > >> > >> > >> > > > > > > > > profile > >> > >> > >> > > > > > > > > > > > data; for example PROFILE_GET. > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * There are 3 separate RPM and DEB > >> packages > >> > >> now > >> > >> > >> > created > >> > >> > >> > > > for > >> > >> > >> > > > > > the > >> > >> > >> > > > > > > > > > > Profiler. > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * metron-profiler-storm-*.rpm > >> > >> > >> > > > > > > > > > > > * metron-profiler-spark-*.rpm > >> > >> > >> > > > > > > > > > > > * metron-profiler-repl-*.rpm > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * The Profiler integration tests were > >> > >> enhanced > >> > >> > to > >> > >> > >> > > > leverage > >> > >> > >> > > > > > the > >> > >> > >> > > > > > > > > > Profiler > >> > >> > >> > > > > > > > > > > > Client logic to validate the results. > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * Review METRON-1699 [1] for a complete > >> > >> > >> break-down of > >> > >> > >> > > the > >> > >> > >> > > > > > tasks > >> > >> > >> > > > > > > > > that > >> > >> > >> > > > > > > > > > > have > >> > >> > >> > > > > > > > > > > > been completed on the feature branch. > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > (Q) What limitations exist? > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * You must manually install Spark to > use > >> > the > >> > >> > Batch > >> > >> > >> > > > > Profiler. > >> > >> > >> > > > > > > The > >> > >> > >> > > > > > > > > > Metron > >> > >> > >> > > > > > > > > > > > MPack does not treat Spark as a Metron > >> > >> dependency > >> > >> > >> and > >> > >> > >> > so > >> > >> > >> > > > does > >> > >> > >> > > > > > not > >> > >> > >> > > > > > > > > > install > >> > >> > >> > > > > > > > > > > > it automatically. > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * You do not configure the Batch > Profiler > >> > in > >> > >> > >> Ambari. > >> > >> > >> > It > >> > >> > >> > > > is > >> > >> > >> > > > > > > > > configured > >> > >> > >> > > > > > > > > > > > and executed completely from the > >> > command-line. > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * To run the Batch Profiler in 'Full > >> Dev', > >> > >> you > >> > >> > >> have > >> > >> > >> > to > >> > >> > >> > > > take > >> > >> > >> > > > > > the > >> > >> > >> > > > > > > > > > > following > >> > >> > >> > > > > > > > > > > > manual steps. Some of these are > arguably > >> > >> > limitations > >> > >> > >> > with > >> > >> > >> > > > how > >> > >> > >> > > > > > > > Ambari > >> > >> > >> > > > > > > > > > > > installs Spark 2 in the version of HDP > that > >> > we > >> > >> > run. > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > 1. Install Spark 2 using Ambari. > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > 2. Tell Spark how to talk with HBase. > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > > >> > >> SPARK_HOME=/usr/hdp/current/spark2-client > >> > >> > >> > > > > > > > > > > > cp > >> > >> > >> > > > /usr/hdp/current/hbase-client/conf/hbase-site.xml > >> > >> > >> > > > > > > > > > > > $SPARK_HOME/conf/ > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > 3. Create the Spark History directory > >> > in > >> > >> > HDFS. > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > export HADOOP_USER_NAME=hdfs > >> > >> > >> > > > > > > > > > > > hdfs dfs -mkdir /spark2-history > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > 4. Change the default input path to > >> > >> > >> > > > > > > > `hdfs://localhost:8020/...` > >> > >> > >> > > > > > > > > > to > >> > >> > >> > > > > > > > > > > > match the port defined by HDP, instead > of > >> > port > >> > >> > 9000. > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > [1] > >> > >> > >> https://issues.apache.org/jira/browse/METRON-1699 > >> > >> > >> > > > > > > > > > > > [2] > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > >> > >> > >> > > > > > > > > > >> > >> > >> > > > > > > > > >> > >> > >> > > > > > > > >> > >> > >> > > > > > > >> > >> > >> > > > > > >> > >> > >> > > > > >> > >> > >> > > > >> > >> > >> > > >> > >> > >> > >> > >> > > >> > >> > >> > > >> > https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E > >> > >> > >> > > > > > > > > > > > [3] > >> > >> > >> > > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > >> > >> > >> > > > > > > > > > >> > >> > >> > > > > > > > > >> > >> > >> > > > > > > > >> > >> > >> > > > > > > >> > >> > >> > > > > > >> > >> > >> > > > > >> > >> > >> > > > >> > >> > >> > > >> > >> > >> > >> > >> > > >> > >> > >> > > >> > https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E > >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > ------------------- > >> > >> > >> > > > > > > > > > > Thank you, > >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > James Sirota > >> > >> > >> > > > > > > > > > > PMC- Apache Metron > >> > >> > >> > > > > > > > > > > jsirota AT apache DOT org > >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > >> > >> > >> > > > > > > > > > >> > >> > >> > > > > > > > > >> > >> > >> > > > > > > > >> > >> > >> > > > > > > >> > >> > >> > > > > > >> > >> > >> > > > > >> > >> > >> > > > >> > >> > >> > > >> > >> > >> > >> > >> > > > >> > >> > > >> > >> > >> > > > >> > > > ------------------- > Thank you, > > James Sirota > PMC- Apache Metron > jsirota AT apache DOT org > >