Re: Configuration-driven ingest for the Open Data Model (ODM) using Spark Streaming (Envelope)

Austin Leahy Wed, 02 May 2018 04:45:54 -0700

Curtis this is very cool thanks for putting so much time into this will
check out the PR and comment.


On Tue, May 1, 2018 at 3:37 PM Curtis Howard <[email protected]> wrote:

> Hi Nathanael,
>
> So far only https://github.com/Open-Network-Insight/spot-nfdump.git
>
> The PR code is a proof-of-concept at this point - look forward to your
> thoughts on next steps though!
>
> Thanks again
> Curtis
>
> On Tue, May 1, 2018 at 6:28 PM, Nate Smith <[email protected]> wrote:
>
> > Curtis,
> >
> > Have you tested this with a standard version of nfdump? Or only
> > spot-nfdump?
> >
> > - Nathanael
> >
> > > On May 1, 2018, at 1:12 PM, Curtis Howard <[email protected]> wrote:
> > >
> > > Hi all,
> > >
> > > We had discussed prototyping Envelope for ingest in the past - I've
> > > submitted a PR for this which includes:
> > >  - Kafka -> Spark streaming -> ODM Hive table applications for dns,
> flow
> > > and proxy raw source data
> > >  - a simple alternative for source data collection/dissection using
> > > tshark/nfdump/unzip + Flume (sinking data to Kafka)
> > >  - https://github.com/apache/incubator-spot/pull/144
> > >
> > > To quote directly from the Envelope site (https://github.com/cloudera-
> > > labs/envelope#envelope):
> > > *"Envelope is simply a pre-made Spark application that implements many
> of
> > > the tasks commonly found in ETL pipelines. In many cases, Envelope
> allows
> > > large pipelines to be developed on Spark with no coding required. When
> > > custom code is needed, there are pluggable points in Envelope for core
> > > functionality to be extended. Envelope works in batch and streaming
> > modes."*
> > >
> > > For example, the complete Kafka/SparkStreaming/ODM ingest application
> > > definition for DNS:
> > > https://github.com/curtishoward/incubator-spot/
> > > blob/SPOT-181_envelope_ingest/spot-ingest/odm/workers/spot_proxy.conf
> > >
> > > From the perspective of the Spot project, my thoughts are that it would
> > > enable:
> > >  - faster turnaround time to ingest new source types while still
> allowing
> > > for arbitrarily complex ETL pipelines (data enrichment, data quality
> > > checks, etc..)
> > >  - simplify future integration with other storage layers (HBase, Kudu,
> > for
> > > example)
> > >  - a framework that is simple to extend (input sources, output storage
> > > layers, translators, derivers, UDFs, ...)
> > >
> > > If there is interest, I will continue to refactor the current
> > > implementation - centralize/integration configuration with spot.conf,
> > test
> > > Kerberos integration, run performance tests and tune as possible.
> > >
> > > In the near term, I will also add a PR with Hive views for
> dns/flow/proxy
> > > under spot-ml/ - this should enable an end-to-end proof-of-concept ODM
> > > implementation using Envelope.
> > >
> > > Thanks
> > > Curtis
> >
> >
>

Re: Configuration-driven ingest for the Open Data Model (ODM) using Spark Streaming (Envelope)

Reply via email to