+1 from me as well. great work 27.09.2018, 11:15, "Ryan Merriman" <merrim...@gmail.com>: > +1 from me. Great work. > > On Thu, Sep 27, 2018 at 12:41 PM Justin Leet <justinjl...@gmail.com> wrote: > >> I'm +1 on merging the feature branch into master. There's a lot of good >> work here, and it's definitely been nice to see the couple remaining >> improvements make it in. >> >> Thanks a lot for the contribution, this is great stuff! >> >> On Wed, Sep 26, 2018 at 6:26 PM Nick Allen <n...@nickallen.org> wrote: >> >> > Or support to be offered for merging this feature branch into master? >> > >> > On Wed, Sep 26, 2018 at 6:20 PM Nick Allen <n...@nickallen.org> wrote: >> > >> > > Thanks for the review. With >> https://github.com/apache/metron/pull/1209 >> > complete, >> > > I think the feature branch is ready to be merged. Sounds like I have >> > > Mike's support. Anyone else have comments, concerns, questions? >> > > >> > > On Tue, Sep 25, 2018 at 10:33 PM Michael Miklavcic < >> > > michael.miklav...@gmail.com> wrote: >> > > >> > >> I just made a couple minor comments on that PR, and I am in agreement >> > >> about >> > >> the readiness for merging with master. Good stuff Nick. >> > >> >> > >> On Fri, Sep 21, 2018 at 12:37 PM Nick Allen <n...@nickallen.org> >> wrote: >> > >> >> > >> > Here is a PR that adds the input time constraints to the Batch >> > Profiler >> > >> > (METRON-1787); https://github.com/apache/metron/pull/1209. >> > >> > >> > >> > It seems that the consensus is that this is probably the last >> feature >> > we >> > >> > need before merging the FB into master. The other two can wait >> until >> > >> after >> > >> > the feature branch has been merged. Let me know if you disagree. >> > >> > >> > >> > Thanks >> > >> > >> > >> > >> > >> > On Thu, Sep 20, 2018 at 1:55 PM Nick Allen <n...@nickallen.org> >> > wrote: >> > >> > >> > >> > > Yeah, agreed. Per use case 3, when deploying to production there >> > >> really >> > >> > > wouldn't be a huge overlap like 3 months of already profiled data. >> > >> Its >> > >> > day >> > >> > > 1, the profile was just deployed around the same time as you are >> > >> running >> > >> > > the Batch Profiler, so the overlap is in minutes, maybe hours. >> But >> > I >> > >> can >> > >> > > definitely see the usefulness of the feature for re-runs, etc as >> you >> > >> have >> > >> > > described. >> > >> > > >> > >> > > Based on this discussion, I created a few JIRAs. Thanks all for >> the >> > >> > great >> > >> > > feedback and keep it coming. >> > >> > > >> > >> > > [1] METRON-1787 - Input Time Constraints for Batch Profiler >> > >> > > [2] METRON-1788 - Fetch Profile Definitions from Zk for Batch >> > Profiler >> > >> > > [3] METRON-1789 - MPack Should Define Default Input Path for Batch >> > >> > > Profiler >> > >> > > >> > >> > > >> > >> > > -- >> > >> > > [1] https://issues.apache.org/jira/browse/METRON-1787 >> > >> > > [2] https://issues.apache.org/jira/browse/METRON-1788 >> > >> > > [3] https://issues.apache.org/jira/browse/METRON-1789 >> > >> > > >> > >> > > >> > >> > > >> > >> > > >> > >> > > >> > >> > > >> > >> > > On Thu, Sep 20, 2018 at 1:34 PM Michael Miklavcic < >> > >> > > michael.miklav...@gmail.com> wrote: >> > >> > > >> > >> > >> I think we might want to allow the flexibility to choose the date >> > >> range >> > >> > >> then. I don't yet feel like I have a good enough understanding of >> > all >> > >> > the >> > >> > >> ways in which users would want to seed to force them to run the >> > batch >> > >> > job >> > >> > >> over all the data. It might also make it easier to deal with >> > >> > remediation, >> > >> > >> ie an error doesn't force you to re-run over the entire history. >> > Same >> > >> > goes >> > >> > >> for testing out the profile seeing batch job in the first place. >> > >> > >> >> > >> > >> On Thu, Sep 20, 2018 at 11:23 AM Nick Allen <n...@nickallen.org> >> > >> wrote: >> > >> > >> >> > >> > >> > Assuming you have 9 months of data archived, yes. >> > >> > >> > >> > >> > >> > On Thu, Sep 20, 2018 at 1:22 PM Michael Miklavcic < >> > >> > >> > michael.miklav...@gmail.com> wrote: >> > >> > >> > >> > >> > >> > > So in the case of 3 - if you had 6 months of data that hadn't >> > >> been >> > >> > >> > profiled >> > >> > >> > > and another 3 that had been profiled (9 months total data), >> in >> > >> its >> > >> > >> > current >> > >> > >> > > form the batch job runs over all 9 months? >> > >> > >> > > >> > >> > >> > > On Thu, Sep 20, 2018 at 11:13 AM Nick Allen < >> > n...@nickallen.org> >> > >> > >> wrote: >> > >> > >> > > >> > >> > >> > > > > How do we establish "tm" from 1.1 above? Any concerns >> about >> > >> > >> overlap >> > >> > >> > or >> > >> > >> > > > gaps after the seeding is performed? >> > >> > >> > > > >> > >> > >> > > > Good point. Right now, if the Streaming and Batch Profiler >> > >> > overlap >> > >> > >> the >> > >> > >> > > > last write wins. And presumably the output of the >> Streaming >> > >> and >> > >> > >> Batch >> > >> > >> > > > Profiler are the same, so no worries, right? :) >> > >> > >> > > > >> > >> > >> > > > So it kind of works, but it is definitely not ideal for use >> > >> case >> > >> > >> 3. I >> > >> > >> > > > could add --begin and --end args to constrain the time >> frame >> > >> over >> > >> > >> which >> > >> > >> > > the >> > >> > >> > > > Batch Profiler runs. I do not have that in the feature >> > branch. >> > >> > It >> > >> > >> > would >> > >> > >> > > > be easy enough to add though. >> > >> > >> > > > >> > >> > >> > > > >> > >> > >> > > > >> > >> > >> > > > On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic < >> > >> > >> > > > michael.miklav...@gmail.com> wrote: >> > >> > >> > > > >> > >> > >> > > > > Ok, makes sense. That's sort of what I was thinking as >> > well, >> > >> > Nick. >> > >> > >> > > > Pulling >> > >> > >> > > > > at this thread just a bit more... >> > >> > >> > > > > >> > >> > >> > > > > 1. I have an existing system that's been up a while, >> > and I >> > >> > have >> > >> > >> > > added >> > >> > >> > > > k >> > >> > >> > > > > profiles - assume these are the first profiles I've >> > >> created. >> > >> > >> > > > > 1. I would have t0 - tm (where m is the time when >> the >> > >> > >> profiles >> > >> > >> > > were >> > >> > >> > > > > first installed) worth of data that has not been >> > >> profiled >> > >> > >> yet. >> > >> > >> > > > > 2. The batch profiler process would be to take that >> > >> exact >> > >> > >> > profile >> > >> > >> > > > > definition from ZK and run the batch loader with >> that >> > >> from >> > >> > >> the >> > >> > >> > > CLI. >> > >> > >> > > > > 3. Profiles are now up to date from t0 - tCurrent >> > >> > >> > > > > 2. I've already done #1 above. Time goes by and now I >> > >> want to >> > >> > >> add >> > >> > >> > a >> > >> > >> > > > new >> > >> > >> > > > > profile. >> > >> > >> > > > > 1. Same first step above >> > >> > >> > > > > 2. I would run the batch loader with *only* that >> new >> > >> > profile >> > >> > >> > > > > definition to seed? >> > >> > >> > > > > >> > >> > >> > > > > Forgive me if I missed this in PR's and discussion in the >> > FB, >> > >> > but >> > >> > >> how >> > >> > >> > > do >> > >> > >> > > > we >> > >> > >> > > > > establish "tm" from 1.1 above? Any concerns about overlap >> > or >> > >> > gaps >> > >> > >> > after >> > >> > >> > > > the >> > >> > >> > > > > seeding is performed? >> > >> > >> > > > > >> > >> > >> > > > > On Thu, Sep 20, 2018 at 10:26 AM Nick Allen < >> > >> n...@nickallen.org >> > >> > > >> > >> > >> > > wrote: >> > >> > >> > > > > >> > >> > >> > > > > > I think more often than not, you would want to load >> your >> > >> > profile >> > >> > >> > > > > definition >> > >> > >> > > > > > from a file. This is why I considered the 'load from >> Zk' >> > >> more >> > >> > >> of a >> > >> > >> > > > > > nice-to-have. >> > >> > >> > > > > > >> > >> > >> > > > > > - In use case 1 and 2, this would definitely be the >> > >> case. >> > >> > >> The >> > >> > >> > > > > profiles >> > >> > >> > > > > > I am working with are speculative and I am using the >> > >> batch >> > >> > >> > > profiler >> > >> > >> > > > to >> > >> > >> > > > > > determine if they are worth keeping. In this case, >> my >> > >> > >> > speculative >> > >> > >> > > > > > profiles >> > >> > >> > > > > > would not be in Zk (yet). >> > >> > >> > > > > > - In use case 3, I could see it go either way. It >> > >> might be >> > >> > >> > useful >> > >> > >> > > > to >> > >> > >> > > > > > load from Zk, but it certainly isn't a blocker. >> > >> > >> > > > > > >> > >> > >> > > > > > >> > >> > >> > > > > > > So if the config does not correctly match the >> profiler >> > >> > config >> > >> > >> > held >> > >> > >> > > in >> > >> > >> > > > > ZK >> > >> > >> > > > > > and >> > >> > >> > > > > > the user runs the batch seeding job, what happens? >> > >> > >> > > > > > >> > >> > >> > > > > > You would just get a profile that is slightly different >> > >> over >> > >> > the >> > >> > >> > > entire >> > >> > >> > > > > > time span. This is not a new risk. If the user >> changes >> > >> their >> > >> > >> > > Profile >> > >> > >> > > > > > definitions in Zk, the same thing would happen. >> > >> > >> > > > > > >> > >> > >> > > > > > >> > >> > >> > > > > > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic < >> > >> > >> > > > > > michael.miklav...@gmail.com> wrote: >> > >> > >> > > > > > >> > >> > >> > > > > > > I think I'm torn on this, specifically because it's >> > batch >> > >> > and >> > >> > >> > would >> > >> > >> > > > > > > generally be run as-needed. Justin, can you elaborate >> > on >> > >> > your >> > >> > >> > > > concerns >> > >> > >> > > > > > > there? This feels functionally very similar to our >> flat >> > >> file >> > >> > >> > > loaders, >> > >> > >> > > > > > which >> > >> > >> > > > > > > all have inputs for config from the CLI only. On the >> > >> other >> > >> > >> hand, >> > >> > >> > > our >> > >> > >> > > > > flat >> > >> > >> > > > > > > file loaders are not typically seeding an existing >> > >> > structure. >> > >> > >> My >> > >> > >> > > > > concern >> > >> > >> > > > > > of >> > >> > >> > > > > > > a local file profiler config stems from this stated >> > goal: >> > >> > >> > > > > > > > The goal would be to enable “profile seeding” which >> > >> allows >> > >> > >> > > profiles >> > >> > >> > > > > to >> > >> > >> > > > > > be >> > >> > >> > > > > > > populated from a time before the profile was created. >> > >> > >> > > > > > > So if the config does not correctly match the >> profiler >> > >> > config >> > >> > >> > held >> > >> > >> > > in >> > >> > >> > > > > ZK >> > >> > >> > > > > > > and the user runs the batch seeding job, what >> happens? >> > >> > >> > > > > > > >> > >> > >> > > > > > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet < >> > >> > >> > > justinjl...@gmail.com> >> > >> > >> > > > > > > wrote: >> > >> > >> > > > > > > >> > >> > >> > > > > > > > The profile not being able to read from ZK feels >> > like a >> > >> > >> fairly >> > >> > >> > > > > > > substantial, >> > >> > >> > > > > > > > if subtle, set of potential problems. I'd like to >> > see >> > >> > that >> > >> > >> in >> > >> > >> > > > either >> > >> > >> > > > > > > > before merging or at least pretty soon after >> merging. >> > >> Is >> > >> > >> it a >> > >> > >> > > lot >> > >> > >> > > > of >> > >> > >> > > > > > > work >> > >> > >> > > > > > > > to add that functionality based on where things are >> > >> right >> > >> > >> now? >> > >> > >> > > > > > > > >> > >> > >> > > > > > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen < >> > >> > >> n...@nickallen.org >> > >> > >> > > >> > >> > >> > > > > wrote: >> > >> > >> > > > > > > > >> > >> > >> > > > > > > > > Here is another limitation that I just thought. >> It >> > >> can >> > >> > >> only >> > >> > >> > > read >> > >> > >> > > > a >> > >> > >> > > > > > > > profile >> > >> > >> > > > > > > > > definition from a file. It probably also makes >> > >> sense to >> > >> > >> add >> > >> > >> > an >> > >> > >> > > > > > option >> > >> > >> > > > > > > > that >> > >> > >> > > > > > > > > allows it to read the current Profiler >> > configuration >> > >> > from >> > >> > >> > > > > Zookeeper. >> > >> > >> > > > > > > > > >> > >> > >> > > > > > > > > >> > >> > >> > > > > > > > > > Is it worth setting up a default config that >> > pulls >> > >> > from >> > >> > >> the >> > >> > >> > > > main >> > >> > >> > > > > > > > indexing >> > >> > >> > > > > > > > > output? >> > >> > >> > > > > > > > > >> > >> > >> > > > > > > > > Yes, I think that makes sense. We want the Batch >> > >> > >> Profiler to >> > >> > >> > > > point >> > >> > >> > > > > > to >> > >> > >> > > > > > > > the >> > >> > >> > > > > > > > > right HDFS URL, no matter where/how Metron is >> > >> deployed. >> > >> > >> When >> > >> > >> > > > > Metron >> > >> > >> > > > > > > gets >> > >> > >> > > > > > > > > spun-up on a cluster, I should be able to just >> run >> > >> the >> > >> > >> Batch >> > >> > >> > > > > Profiler >> > >> > >> > > > > > > > > without having to fuss with the input path. >> > >> > >> > > > > > > > > >> > >> > >> > > > > > > > > >> > >> > >> > > > > > > > > >> > >> > >> > > > > > > > > >> > >> > >> > > > > > > > > >> > >> > >> > > > > > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet < >> > >> > >> > > > justinjl...@gmail.com >> > >> > >> > > > > > >> > >> > >> > > > > > > > wrote: >> > >> > >> > > > > > > > > >> > >> > >> > > > > > > > > > Re: >> > >> > >> > > > > > > > > > >> > >> > >> > > > > > > > > > > * You do not configure the Batch Profiler in >> > >> > >> Ambari. It >> > >> > >> > > is >> > >> > >> > > > > > > > configured >> > >> > >> > > > > > > > > > > and executed completely from the >> command-line. >> > >> > >> > > > > > > > > > > >> > >> > >> > > > > > > > > > >> > >> > >> > > > > > > > > > Is it worth setting up a default config that >> > pulls >> > >> > from >> > >> > >> the >> > >> > >> > > > main >> > >> > >> > > > > > > > indexing >> > >> > >> > > > > > > > > > output? I'm a little on the fence about it, >> but >> > it >> > >> > >> seems >> > >> > >> > > like >> > >> > >> > > > > > making >> > >> > >> > > > > > > > the >> > >> > >> > > > > > > > > > most common case more or less built-in would be >> > >> nice. >> > >> > >> > > > > > > > > > >> > >> > >> > > > > > > > > > Having said that, I do not consider that a >> > >> requirement >> > >> > >> for >> > >> > >> > > > > merging >> > >> > >> > > > > > > the >> > >> > >> > > > > > > > > > feature branch. >> > >> > >> > > > > > > > > > >> > >> > >> > > > > > > > > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota < >> > >> > >> > > > > jsir...@apache.org> >> > >> > >> > > > > > > > > wrote: >> > >> > >> > > > > > > > > > >> > >> > >> > > > > > > > > > > I think what you have outlined above is a >> good >> > >> > initial >> > >> > >> > stab >> > >> > >> > > > at >> > >> > >> > > > > > the >> > >> > >> > > > > > > > > > > feature. Manual install of spark is not a >> big >> > >> deal. >> > >> > >> > > > > Configuring >> > >> > >> > > > > > > via >> > >> > >> > > > > > > > > > > command line while we mature this feature is >> ok >> > >> as >> > >> > >> well. >> > >> > >> > > > > Doesn't >> > >> > >> > > > > > > > look >> > >> > >> > > > > > > > > > like >> > >> > >> > > > > > > > > > > configuration steps are too hard. I think >> you >> > >> > should >> > >> > >> > > merge. >> > >> > >> > > > > > > > > > > >> > >> > >> > > > > > > > > > > James >> > >> > >> > > > > > > > > > > >> > >> > >> > > > > > > > > > > 19.09.2018, 08:15, "Nick Allen" < >> > >> n...@nickallen.org >> > >> > >: >> > >> > >> > > > > > > > > > > > I would like to open a discussion to get >> the >> > >> Batch >> > >> > >> > > Profiler >> > >> > >> > > > > > > feature >> > >> > >> > > > > > > > > > > branch >> > >> > >> > > > > > > > > > > > merged into master as part of METRON-1699 >> [1] >> > >> > Create >> > >> > >> > > Batch >> > >> > >> > > > > > > > Profiler. >> > >> > >> > > > > > > > > > All >> > >> > >> > > > > > > > > > > > of the work that I had in mind for our >> first >> > >> draft >> > >> > >> of >> > >> > >> > the >> > >> > >> > > > > Batch >> > >> > >> > > > > > > > > > Profiler >> > >> > >> > > > > > > > > > > > has been completed. Please take a look >> > through >> > >> > what >> > >> > >> I >> > >> > >> > > have >> > >> > >> > > > > and >> > >> > >> > > > > > > let >> > >> > >> > > > > > > > me >> > >> > >> > > > > > > > > > > know >> > >> > >> > > > > > > > > > > > if there are other features that you think >> > are >> > >> > >> required >> > >> > >> > > > > > *before* >> > >> > >> > > > > > > we >> > >> > >> > > > > > > > > > > merge. >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > Previous list discussions on this topic >> > include >> > >> > [2] >> > >> > >> and >> > >> > >> > > > [3]. >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > (Q) What can I do with the feature branch? >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * With the Batch Profiler, you can >> > >> backfill/seed >> > >> > >> > > profiles >> > >> > >> > > > > > using >> > >> > >> > > > > > > > > > > archived >> > >> > >> > > > > > > > > > > > telemetry. This enables the following types >> > of >> > >> use >> > >> > >> > cases. >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > 1. As a Security Data Scientist, I >> want >> > >> to >> > >> > >> > > understand >> > >> > >> > > > > the >> > >> > >> > > > > > > > > > > historical >> > >> > >> > > > > > > > > > > > behaviors and trends of a profile that I >> have >> > >> > >> created >> > >> > >> > so >> > >> > >> > > > > that I >> > >> > >> > > > > > > can >> > >> > >> > > > > > > > > > > > determine if I have created a feature set >> > that >> > >> has >> > >> > >> > > > predictive >> > >> > >> > > > > > > value >> > >> > >> > > > > > > > > for >> > >> > >> > > > > > > > > > > > model building. >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > 2. As a Security Data Scientist, I >> want >> > >> to >> > >> > >> > > understand >> > >> > >> > > > > the >> > >> > >> > > > > > > > > > > historical >> > >> > >> > > > > > > > > > > > behaviors and trends of a profile that I >> have >> > >> > >> created >> > >> > >> > so >> > >> > >> > > > > that I >> > >> > >> > > > > > > can >> > >> > >> > > > > > > > > > > > determine if I have defined the profile >> > >> correctly >> > >> > >> and >> > >> > >> > > > > created a >> > >> > >> > > > > > > > > feature >> > >> > >> > > > > > > > > > > set >> > >> > >> > > > > > > > > > > > that matches reality. >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > 3. As a Security Platform Engineer, I >> > >> want >> > >> > to >> > >> > >> > > > generate >> > >> > >> > > > > a >> > >> > >> > > > > > > > > profile >> > >> > >> > > > > > > > > > > > using archived telemetry when I deploy a >> new >> > >> model >> > >> > >> to >> > >> > >> > > > > > production >> > >> > >> > > > > > > so >> > >> > >> > > > > > > > > > that >> > >> > >> > > > > > > > > > > > models depending on that profile can >> function >> > >> on >> > >> > >> day 1. >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * METRON-1699 [1] includes a more >> detailed >> > >> > >> > description >> > >> > >> > > of >> > >> > >> > > > > the >> > >> > >> > > > > > > > > > feature. >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > (Q) What work was completed? >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * The Batch Profiler runs on Spark and >> was >> > >> > >> > implemented >> > >> > >> > > in >> > >> > >> > > > > > Java >> > >> > >> > > > > > > to >> > >> > >> > > > > > > > > > > remain >> > >> > >> > > > > > > > > > > > consistent with our current Java-heavy code >> > >> base. >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * The Batch Profiler is executed from the >> > >> > >> > command-line. >> > >> > >> > > > It >> > >> > >> > > > > > can >> > >> > >> > > > > > > be >> > >> > >> > > > > > > > > > > > launched using a script or by calling >> > >> > >> `spark-submit`, >> > >> > >> > > which >> > >> > >> > > > > may >> > >> > >> > > > > > > be >> > >> > >> > > > > > > > > > useful >> > >> > >> > > > > > > > > > > > for advanced users. >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * Input telemetry can be consumed from >> > >> multiple >> > >> > >> > > sources; >> > >> > >> > > > > for >> > >> > >> > > > > > > > > example >> > >> > >> > > > > > > > > > > HDFS >> > >> > >> > > > > > > > > > > > or the local file system. >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * Input telemetry can be consumed in >> > multiple >> > >> > >> > formats; >> > >> > >> > > > for >> > >> > >> > > > > > > > example >> > >> > >> > > > > > > > > > JSON >> > >> > >> > > > > > > > > > > > or ORC. >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * The 'output' profile measurements are >> > >> > persisted >> > >> > >> in >> > >> > >> > > > HBase >> > >> > >> > > > > > and >> > >> > >> > > > > > > is >> > >> > >> > > > > > > > > > > > consistent with the Storm Profiler. >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * It can be run on any underlying engine >> > >> > >> supported by >> > >> > >> > > > > Spark. >> > >> > >> > > > > > I >> > >> > >> > > > > > > > have >> > >> > >> > > > > > > > > > > > tested it both in 'local' mode and on a >> YARN >> > >> > >> cluster. >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * It is installed automatically by the >> > Metron >> > >> > >> MPack. >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * A README was added that documents usage >> > >> > >> > instructions. >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * The existing Profiler code was >> refactored >> > >> so >> > >> > >> that >> > >> > >> > as >> > >> > >> > > > much >> > >> > >> > > > > > > code >> > >> > >> > > > > > > > as >> > >> > >> > > > > > > > > > > > possible is shared between the 3 Profiler >> > >> ports; >> > >> > >> Storm, >> > >> > >> > > the >> > >> > >> > > > > > > Stellar >> > >> > >> > > > > > > > > > REPL, >> > >> > >> > > > > > > > > > > > and Spark. For example, the logic which >> > >> determines >> > >> > >> the >> > >> > >> > > > > > timestamp >> > >> > >> > > > > > > > of a >> > >> > >> > > > > > > > > > > > message was refactored so that it could be >> > >> reused >> > >> > by >> > >> > >> > all >> > >> > >> > > > > ports. >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * metron-profiler-common: The common >> > >> > Profiler >> > >> > >> > code >> > >> > >> > > > > shared >> > >> > >> > > > > > > > > amongst >> > >> > >> > > > > > > > > > > > each port. >> > >> > >> > > > > > > > > > > > * metron-profiler-storm: Profiler on >> > >> Storm >> > >> > >> > > > > > > > > > > > * metron-profiler-spark: Profiler on >> > >> Spark >> > >> > >> > > > > > > > > > > > * metron-profiler-repl: Profiler on >> the >> > >> > >> Stellar >> > >> > >> > > REPL >> > >> > >> > > > > > > > > > > > * metron-profiler-client: The client >> > code >> > >> > for >> > >> > >> > > > > retrieving >> > >> > >> > > > > > > > > profile >> > >> > >> > > > > > > > > > > > data; for example PROFILE_GET. >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * There are 3 separate RPM and DEB >> packages >> > >> now >> > >> > >> > created >> > >> > >> > > > for >> > >> > >> > > > > > the >> > >> > >> > > > > > > > > > > Profiler. >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * metron-profiler-storm-*.rpm >> > >> > >> > > > > > > > > > > > * metron-profiler-spark-*.rpm >> > >> > >> > > > > > > > > > > > * metron-profiler-repl-*.rpm >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * The Profiler integration tests were >> > >> enhanced >> > >> > to >> > >> > >> > > > leverage >> > >> > >> > > > > > the >> > >> > >> > > > > > > > > > Profiler >> > >> > >> > > > > > > > > > > > Client logic to validate the results. >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * Review METRON-1699 [1] for a complete >> > >> > >> break-down of >> > >> > >> > > the >> > >> > >> > > > > > tasks >> > >> > >> > > > > > > > > that >> > >> > >> > > > > > > > > > > have >> > >> > >> > > > > > > > > > > > been completed on the feature branch. >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > (Q) What limitations exist? >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * You must manually install Spark to use >> > the >> > >> > Batch >> > >> > >> > > > > Profiler. >> > >> > >> > > > > > > The >> > >> > >> > > > > > > > > > Metron >> > >> > >> > > > > > > > > > > > MPack does not treat Spark as a Metron >> > >> dependency >> > >> > >> and >> > >> > >> > so >> > >> > >> > > > does >> > >> > >> > > > > > not >> > >> > >> > > > > > > > > > install >> > >> > >> > > > > > > > > > > > it automatically. >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * You do not configure the Batch Profiler >> > in >> > >> > >> Ambari. >> > >> > >> > It >> > >> > >> > > > is >> > >> > >> > > > > > > > > configured >> > >> > >> > > > > > > > > > > > and executed completely from the >> > command-line. >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > * To run the Batch Profiler in 'Full >> Dev', >> > >> you >> > >> > >> have >> > >> > >> > to >> > >> > >> > > > take >> > >> > >> > > > > > the >> > >> > >> > > > > > > > > > > following >> > >> > >> > > > > > > > > > > > manual steps. Some of these are arguably >> > >> > limitations >> > >> > >> > with >> > >> > >> > > > how >> > >> > >> > > > > > > > Ambari >> > >> > >> > > > > > > > > > > > installs Spark 2 in the version of HDP that >> > we >> > >> > run. >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > 1. Install Spark 2 using Ambari. >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > 2. Tell Spark how to talk with HBase. >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > >> > >> SPARK_HOME=/usr/hdp/current/spark2-client >> > >> > >> > > > > > > > > > > > cp >> > >> > >> > > > /usr/hdp/current/hbase-client/conf/hbase-site.xml >> > >> > >> > > > > > > > > > > > $SPARK_HOME/conf/ >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > 3. Create the Spark History directory >> > in >> > >> > HDFS. >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > export HADOOP_USER_NAME=hdfs >> > >> > >> > > > > > > > > > > > hdfs dfs -mkdir /spark2-history >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > 4. Change the default input path to >> > >> > >> > > > > > > > `hdfs://localhost:8020/...` >> > >> > >> > > > > > > > > > to >> > >> > >> > > > > > > > > > > > match the port defined by HDP, instead of >> > port >> > >> > 9000. >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > > [1] >> > >> > >> https://issues.apache.org/jira/browse/METRON-1699 >> > >> > >> > > > > > > > > > > > [2] >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > >> > >> > >> > > > > > > > > > >> > >> > >> > > > > > > > > >> > >> > >> > > > > > > > >> > >> > >> > > > > > > >> > >> > >> > > > > > >> > >> > >> > > > > >> > >> > >> > > > >> > >> > >> > > >> > >> > >> > >> > >> > >> >> > >> > >> > >> >> > >> >> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E >> > >> > >> > > > > > > > > > > > [3] >> > >> > >> > > > > > > > > > > > >> > >> > >> > > > > > > > > > > >> > >> > >> > > > > > > > > > >> > >> > >> > > > > > > > > >> > >> > >> > > > > > > > >> > >> > >> > > > > > > >> > >> > >> > > > > > >> > >> > >> > > > > >> > >> > >> > > > >> > >> > >> > > >> > >> > >> > >> > >> > >> >> > >> > >> > >> >> > >> >> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E >> > >> > >> > > > > > > > > > > >> > >> > >> > > > > > > > > > > ------------------- >> > >> > >> > > > > > > > > > > Thank you, >> > >> > >> > > > > > > > > > > >> > >> > >> > > > > > > > > > > James Sirota >> > >> > >> > > > > > > > > > > PMC- Apache Metron >> > >> > >> > > > > > > > > > > jsirota AT apache DOT org >> > >> > >> > > > > > > > > > > >> > >> > >> > > > > > > > > > > >> > >> > >> > > > > > > > > > >> > >> > >> > > > > > > > > >> > >> > >> > > > > > > > >> > >> > >> > > > > > > >> > >> > >> > > > > > >> > >> > >> > > > > >> > >> > >> > > > >> > >> > >> > > >> > >> > >> > >> > >> > >> >> > >> > > >> > >> > >> > >> >> > > >> >
------------------- Thank you, James Sirota PMC- Apache Metron jsirota AT apache DOT org