Good points Otto +1 to all that.

On the Spark question, we should definitely be more deliberate about it. We
currently have an implicit dependency on spark through the zeppelin
notebooks. Most implementations I've seen of Metron also have some sort of
Spark work built around them. The current full dev HDP build is the latest
2.6.5 version available, even though the profile names 2.5.3. I'm not sure
we should take on jumping to 3.0 just yet for this effort. With the current
version we get Spark 2.3.0 by default, which would seem to do.

On point two... yes, this does seem very much like a first step in the
direction of being able to replace Storm, but I would say that probably
deserves its very own feature branch. I would say we want to use things
like the structured streaming capability for this, which may remove the
need for some of the custom batch writers we have in Metron, delegating
those capabilities to Spark. My one concern around here would be the fact
that Spark Continuous Triggers are still alpha grade, so we would have to
take some micro-batch latency with a move to Spark. Realistically we have
this issue anyway in Storm world, because we have to batch processing there
too.

I wonder whether it's worth considering an existing spark 'host' such as
Apache Livy for managing jobs (not sure if that would actually add any
value) and I'm particularly keen on being able to use things like Spark to
query historical data under our current DAOs to drive UI.

Simon


On 30 July 2018 at 14:50, Otto Fowler <ottobackwa...@gmail.com> wrote:

> I think the feature branch is a good idea, but what is in the feature
> branch or feature branches will have to shake out.
>
> I agree in concept with what you have in the jira, but I have two points.
>
>    1. We will need a break down of introducing Spark to the stack
>       - required version due to HDP support
>       - do we want to update HDP support before this?
>       - Spark tuning/defaults
>       - Spark configuration support / UI etc
>       - more….
>    2.
>
>    When I read this, it seems like a Lambda architecture approach. Should
>    we, as part of this start exploring the possibility to replacing storm
> with
>    spark streaming such that we do not have to maintain separate streaming
> vs.
>    batch codebases?
>    3. This mechanism would be used in the future for telemetry ‘replay’.
>    That would mean that ( IMHO )
>       - we should understand that case as well for this
>       - build this capability out such that it is generic enough that a
>       second use will not warrant a re-write or huge refactor
>
> I think this breaks down to a few sets of functionality:
>
>    -
>
>    Base support for deployment, management or spark
>    -
>
>    Metron services for triggering, and monitoring of Apache Spark ( on
>    demand and constant ), maybe rest stuff like the caps
>    -
>
>    UI / Stellar base support
>    -
>
>    Build out of Batch Profiler service on top of that
>    -
>
>    Build out of replay service on top of that ( plus all the replay stuff
>    that needs to also be done - like are you replacing data or having two
>    sets…. trial runs etc )
>    -
>
>    ????
>    -
>
>    profit
>
>
>
>
> On July 27, 2018 at 11:29:51, Nick Allen (n...@nickallen.org) wrote:
>
> Hi Everyone -
>
> A while back I opened up a discuss thread around the general idea of a
> Batch Profiler [1]. I'd like to start making progress on a first draft of
> that functionality.
>
> I created METRON-1699 [2] which outlines the general approach and ideas.
> If you're interested, review that JIRA and let me know if you have
> feedback. I will be adding sub-tasks to that JIRA as I make progress and
> can separate it into logical bits for review.
>
> I would like this effort to use a feature branch as it will take a number
> of PRs to get a first cut on the functionality. Pending no disagreement, I
> will create the feature branch based on METRON-1699.
>
> [1]
> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4e
> e601041fb47bfc97acb6825083@%3Cdev...
>
> <
> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4e
> e601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E>
>
> [2] https://issues.apache.org/jira/browse/METRON-1699
>



-- 
--
simon elliston ball
@sireb

Reply via email to