Re: [DISCUSS] Batch Profiler

Nick Allen Thu, 16 Aug 2018 11:06:01 -0700

FYI - Work is progressing on the Batch Profiler in Spark.  For those
interested, feel free to take a look at any of the PRs that are open on
this feature branch.


https://github.com/apache/metron/pulls/nickwallen

On Mon, Jul 30, 2018 at 10:50 AM, Nick Allen <n...@nickallen.org> wrote:

> >>  1. We will need a break down of introducing Spark to the stack; required
> version due to HDP support; do we want to update HDP support before
> this?; Spark tuning/defaults; Spark configuration support / UI etc
>
> All sounds useful. I'm not sure how much of that we can do before we have
> the code that actually runs in Spark though.  For example, you can't
> provide tuning defaults or configuration support until you have the code
> that needs tuned and configured.  I see these as good follow-ons though.
>
>
> >> 2. When I read this, it seems like a Lambda architecture approach.
> Should we, as part of this start exploring the possibility to replacing
> storm with spark streaming such that we do not have to maintain separate
> streaming vs. batch codebases?
>
> Yes, I definitely think thats a likely possibility.  Its probably not
> something I want to bite off as part of this work though.  I'd like to just
> focus on getting the Batch Profiler functionality right.
>
>
> Here is what I had in mind for the initial set of PRs for the feature
> branch.
>
> 1. There are some backwards-compatible changes needed to run the core
> Profiler.  The current Profiler ports in Storm and REPL would continue to
> work as-is.
> 2. Introduce the initial code for the Batch Profiler.  This would require
> the user to manually install Spark and do some manual setup to deploy and
> run it.
> 3. Iterate on some unit and integration test enhancements for the Batch
> Profiler.
> 3. Create packaging; the RPMs, DEBs.
> 4. Add support in the MPack to deploy the Batch Profiler.
> 5. Enhance support for alternative input formats.  Initially support the
> raw JSON we archive in Full Dev. But in real-world use cases, the telemetry
> is going to be stored in alternative formats like ORC.  I'd like to make it
> as easy as possible to support multiple input formats.
>
>
>
> Thanks
>
>
>
>
> On Mon, Jul 30, 2018 at 9:50 AM, Otto Fowler <ottobackwa...@gmail.com>
> wrote:
>
>> I think the feature branch is a good idea, but what is in the feature
>> branch or feature branches will have to shake out.
>>
>> I agree in concept with what you have in the jira, but I have two points.
>>
>>    1. We will need a break down of introducing Spark to the stack
>>       - required version due to HDP support
>>       - do we want to update HDP support before this?
>>       - Spark tuning/defaults
>>       - Spark configuration support / UI etc
>>       - more….
>>    2.
>>
>>    When I read this, it seems like a Lambda architecture approach.
>>    Should we, as part of this start exploring the possibility to replacing
>>    storm with spark streaming such that we do not have to maintain separate
>>    streaming vs. batch codebases?
>>    3. This mechanism would be used in the future for telemetry ‘replay’.
>>    That would mean that ( IMHO )
>>       - we should understand that case as well for this
>>       - build this capability out such that it is generic enough that a
>>       second use will not warrant a re-write or huge refactor
>>
>> I think this breaks down to a few sets of functionality:
>>
>>    -
>>
>>    Base support for deployment, management or spark
>>    -
>>
>>    Metron services for triggering, and monitoring of Apache Spark ( on
>>    demand and constant ), maybe rest stuff like the caps
>>    -
>>
>>    UI / Stellar base support
>>    -
>>
>>    Build out of Batch Profiler service on top of that
>>    -
>>
>>    Build out of replay service on top of that ( plus all the replay
>>    stuff that needs to also be done - like are you replacing data or having
>>    two sets…. trial runs etc )
>>    -
>>
>>    ????
>>    -
>>
>>    profit
>>
>>
>>
>>
>> On July 27, 2018 at 11:29:51, Nick Allen (n...@nickallen.org) wrote:
>>
>> Hi Everyone -
>>
>> A while back I opened up a discuss thread around the general idea of a
>> Batch Profiler [1]. I'd like to start making progress on a first draft of
>> that functionality.
>>
>> I created METRON-1699 [2] which outlines the general approach and ideas.
>> If you're interested, review that JIRA and let me know if you have
>> feedback. I will be adding sub-tasks to that JIRA as I make progress and
>> can separate it into logical bits for review.
>>
>> I would like this effort to use a feature branch as it will take a number
>> of PRs to get a first cut on the functionality. Pending no disagreement,
>> I
>> will create the feature branch based on METRON-1699.
>>
>> [1]
>> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c
>> 304ff4ee601041fb47bfc97acb6825083@%3Cdev...
>> <https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7
>> c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E>
>> [2] https://issues.apache.org/jira/browse/METRON-1699
>>
>>
>

Re: [DISCUSS] Batch Profiler

Reply via email to