FYI - Work is progressing on the Batch Profiler in Spark. For those interested, feel free to take a look at any of the PRs that are open on this feature branch.
https://github.com/apache/metron/pulls/nickwallen On Mon, Jul 30, 2018 at 10:50 AM, Nick Allen <n...@nickallen.org> wrote: > >> 1. We will need a break down of introducing Spark to the stack; required > version due to HDP support; do we want to update HDP support before > this?; Spark tuning/defaults; Spark configuration support / UI etc > > All sounds useful. I'm not sure how much of that we can do before we have > the code that actually runs in Spark though. For example, you can't > provide tuning defaults or configuration support until you have the code > that needs tuned and configured. I see these as good follow-ons though. > > > >> 2. When I read this, it seems like a Lambda architecture approach. > Should we, as part of this start exploring the possibility to replacing > storm with spark streaming such that we do not have to maintain separate > streaming vs. batch codebases? > > Yes, I definitely think thats a likely possibility. Its probably not > something I want to bite off as part of this work though. I'd like to just > focus on getting the Batch Profiler functionality right. > > > Here is what I had in mind for the initial set of PRs for the feature > branch. > > 1. There are some backwards-compatible changes needed to run the core > Profiler. The current Profiler ports in Storm and REPL would continue to > work as-is. > 2. Introduce the initial code for the Batch Profiler. This would require > the user to manually install Spark and do some manual setup to deploy and > run it. > 3. Iterate on some unit and integration test enhancements for the Batch > Profiler. > 3. Create packaging; the RPMs, DEBs. > 4. Add support in the MPack to deploy the Batch Profiler. > 5. Enhance support for alternative input formats. Initially support the > raw JSON we archive in Full Dev. But in real-world use cases, the telemetry > is going to be stored in alternative formats like ORC. I'd like to make it > as easy as possible to support multiple input formats. > > > > Thanks > > > > > On Mon, Jul 30, 2018 at 9:50 AM, Otto Fowler <ottobackwa...@gmail.com> > wrote: > >> I think the feature branch is a good idea, but what is in the feature >> branch or feature branches will have to shake out. >> >> I agree in concept with what you have in the jira, but I have two points. >> >> 1. We will need a break down of introducing Spark to the stack >> - required version due to HDP support >> - do we want to update HDP support before this? >> - Spark tuning/defaults >> - Spark configuration support / UI etc >> - more…. >> 2. >> >> When I read this, it seems like a Lambda architecture approach. >> Should we, as part of this start exploring the possibility to replacing >> storm with spark streaming such that we do not have to maintain separate >> streaming vs. batch codebases? >> 3. This mechanism would be used in the future for telemetry ‘replay’. >> That would mean that ( IMHO ) >> - we should understand that case as well for this >> - build this capability out such that it is generic enough that a >> second use will not warrant a re-write or huge refactor >> >> I think this breaks down to a few sets of functionality: >> >> - >> >> Base support for deployment, management or spark >> - >> >> Metron services for triggering, and monitoring of Apache Spark ( on >> demand and constant ), maybe rest stuff like the caps >> - >> >> UI / Stellar base support >> - >> >> Build out of Batch Profiler service on top of that >> - >> >> Build out of replay service on top of that ( plus all the replay >> stuff that needs to also be done - like are you replacing data or having >> two sets…. trial runs etc ) >> - >> >> ???? >> - >> >> profit >> >> >> >> >> On July 27, 2018 at 11:29:51, Nick Allen (n...@nickallen.org) wrote: >> >> Hi Everyone - >> >> A while back I opened up a discuss thread around the general idea of a >> Batch Profiler [1]. I'd like to start making progress on a first draft of >> that functionality. >> >> I created METRON-1699 [2] which outlines the general approach and ideas. >> If you're interested, review that JIRA and let me know if you have >> feedback. I will be adding sub-tasks to that JIRA as I make progress and >> can separate it into logical bits for review. >> >> I would like this effort to use a feature branch as it will take a number >> of PRs to get a first cut on the functionality. Pending no disagreement, >> I >> will create the feature branch based on METRON-1699. >> >> [1] >> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c >> 304ff4ee601041fb47bfc97acb6825083@%3Cdev... >> <https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7 >> c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E> >> [2] https://issues.apache.org/jira/browse/METRON-1699 >> >> >