The profile not being able to read from ZK feels like a fairly substantial, if subtle, set of potential problems. I'd like to see that in either before merging or at least pretty soon after merging. Is it a lot of work to add that functionality based on where things are right now?
On Thu, Sep 20, 2018 at 9:59 AM Nick Allen <n...@nickallen.org> wrote: > Here is another limitation that I just thought. It can only read a profile > definition from a file. It probably also makes sense to add an option that > allows it to read the current Profiler configuration from Zookeeper. > > > > Is it worth setting up a default config that pulls from the main indexing > output? > > Yes, I think that makes sense. We want the Batch Profiler to point to the > right HDFS URL, no matter where/how Metron is deployed. When Metron gets > spun-up on a cluster, I should be able to just run the Batch Profiler > without having to fuss with the input path. > > > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet <justinjl...@gmail.com> wrote: > > > Re: > > > > > * You do not configure the Batch Profiler in Ambari. It is configured > > > and executed completely from the command-line. > > > > > > > Is it worth setting up a default config that pulls from the main indexing > > output? I'm a little on the fence about it, but it seems like making the > > most common case more or less built-in would be nice. > > > > Having said that, I do not consider that a requirement for merging the > > feature branch. > > > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota <jsir...@apache.org> > wrote: > > > > > I think what you have outlined above is a good initial stab at the > > > feature. Manual install of spark is not a big deal. Configuring via > > > command line while we mature this feature is ok as well. Doesn't look > > like > > > configuration steps are too hard. I think you should merge. > > > > > > James > > > > > > 19.09.2018, 08:15, "Nick Allen" <n...@nickallen.org>: > > > > I would like to open a discussion to get the Batch Profiler feature > > > branch > > > > merged into master as part of METRON-1699 [1] Create Batch Profiler. > > All > > > > of the work that I had in mind for our first draft of the Batch > > Profiler > > > > has been completed. Please take a look through what I have and let me > > > know > > > > if there are other features that you think are required *before* we > > > merge. > > > > > > > > Previous list discussions on this topic include [2] and [3]. > > > > > > > > (Q) What can I do with the feature branch? > > > > > > > > * With the Batch Profiler, you can backfill/seed profiles using > > > archived > > > > telemetry. This enables the following types of use cases. > > > > > > > > 1. As a Security Data Scientist, I want to understand the > > > historical > > > > behaviors and trends of a profile that I have created so that I can > > > > determine if I have created a feature set that has predictive value > for > > > > model building. > > > > > > > > 2. As a Security Data Scientist, I want to understand the > > > historical > > > > behaviors and trends of a profile that I have created so that I can > > > > determine if I have defined the profile correctly and created a > feature > > > set > > > > that matches reality. > > > > > > > > 3. As a Security Platform Engineer, I want to generate a > profile > > > > using archived telemetry when I deploy a new model to production so > > that > > > > models depending on that profile can function on day 1. > > > > > > > > * METRON-1699 [1] includes a more detailed description of the > > feature. > > > > > > > > (Q) What work was completed? > > > > > > > > * The Batch Profiler runs on Spark and was implemented in Java to > > > remain > > > > consistent with our current Java-heavy code base. > > > > > > > > * The Batch Profiler is executed from the command-line. It can be > > > > launched using a script or by calling `spark-submit`, which may be > > useful > > > > for advanced users. > > > > > > > > * Input telemetry can be consumed from multiple sources; for > example > > > HDFS > > > > or the local file system. > > > > > > > > * Input telemetry can be consumed in multiple formats; for example > > JSON > > > > or ORC. > > > > > > > > * The 'output' profile measurements are persisted in HBase and is > > > > consistent with the Storm Profiler. > > > > > > > > * It can be run on any underlying engine supported by Spark. I have > > > > tested it both in 'local' mode and on a YARN cluster. > > > > > > > > * It is installed automatically by the Metron MPack. > > > > > > > > * A README was added that documents usage instructions. > > > > > > > > * The existing Profiler code was refactored so that as much code as > > > > possible is shared between the 3 Profiler ports; Storm, the Stellar > > REPL, > > > > and Spark. For example, the logic which determines the timestamp of a > > > > message was refactored so that it could be reused by all ports. > > > > > > > > * metron-profiler-common: The common Profiler code shared > amongst > > > > each port. > > > > * metron-profiler-storm: Profiler on Storm > > > > * metron-profiler-spark: Profiler on Spark > > > > * metron-profiler-repl: Profiler on the Stellar REPL > > > > * metron-profiler-client: The client code for retrieving > profile > > > > data; for example PROFILE_GET. > > > > > > > > * There are 3 separate RPM and DEB packages now created for the > > > Profiler. > > > > > > > > * metron-profiler-storm-*.rpm > > > > * metron-profiler-spark-*.rpm > > > > * metron-profiler-repl-*.rpm > > > > > > > > * The Profiler integration tests were enhanced to leverage the > > Profiler > > > > Client logic to validate the results. > > > > > > > > * Review METRON-1699 [1] for a complete break-down of the tasks > that > > > have > > > > been completed on the feature branch. > > > > > > > > (Q) What limitations exist? > > > > > > > > * You must manually install Spark to use the Batch Profiler. The > > Metron > > > > MPack does not treat Spark as a Metron dependency and so does not > > install > > > > it automatically. > > > > > > > > * You do not configure the Batch Profiler in Ambari. It is > configured > > > > and executed completely from the command-line. > > > > > > > > * To run the Batch Profiler in 'Full Dev', you have to take the > > > following > > > > manual steps. Some of these are arguably limitations with how Ambari > > > > installs Spark 2 in the version of HDP that we run. > > > > > > > > 1. Install Spark 2 using Ambari. > > > > > > > > 2. Tell Spark how to talk with HBase. > > > > > > > > SPARK_HOME=/usr/hdp/current/spark2-client > > > > cp /usr/hdp/current/hbase-client/conf/hbase-site.xml > > > > $SPARK_HOME/conf/ > > > > > > > > 3. Create the Spark History directory in HDFS. > > > > > > > > export HADOOP_USER_NAME=hdfs > > > > hdfs dfs -mkdir /spark2-history > > > > > > > > 4. Change the default input path to `hdfs://localhost:8020/...` > > to > > > > match the port defined by HDP, instead of port 9000. > > > > > > > > [1] https://issues.apache.org/jira/browse/METRON-1699 > > > > [2] > > > > > > > > > > https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E > > > > [3] > > > > > > > > > > https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E > > > > > > ------------------- > > > Thank you, > > > > > > James Sirota > > > PMC- Apache Metron > > > jsirota AT apache DOT org > > > > > > > > >