Configuration-driven ingest for the Open Data Model (ODM) using Spark Streaming (Envelope)

Curtis Howard Tue, 01 May 2018 13:13:37 -0700

Hi all,

We had discussed prototyping Envelope for ingest in the past - I've
submitted a PR for this which includes:
  - Kafka -> Spark streaming -> ODM Hive table applications for dns, flow
and proxy raw source data
  - a simple alternative for source data collection/dissection using
tshark/nfdump/unzip + Flume (sinking data to Kafka)
  - https://github.com/apache/incubator-spot/pull/144


To quote directly from the Envelope site (https://github.com/cloudera-
labs/envelope#envelope):
*"Envelope is simply a pre-made Spark application that implements many of
the tasks commonly found in ETL pipelines. In many cases, Envelope allows
large pipelines to be developed on Spark with no coding required. When
custom code is needed, there are pluggable points in Envelope for core
functionality to be extended. Envelope works in batch and streaming modes."*

For example, the complete Kafka/SparkStreaming/ODM ingest application
definition for DNS:
https://github.com/curtishoward/incubator-spot/
blob/SPOT-181_envelope_ingest/spot-ingest/odm/workers/spot_proxy.conf

>From the perspective of the Spot project, my thoughts are that it would
enable:
  - faster turnaround time to ingest new source types while still allowing
for arbitrarily complex ETL pipelines (data enrichment, data quality
checks, etc..)
  - simplify future integration with other storage layers (HBase, Kudu, for
example)
  - a framework that is simple to extend (input sources, output storage
layers, translators, derivers, UDFs, ...)

If there is interest, I will continue to refactor the current
implementation - centralize/integration configuration with spot.conf, test
Kerberos integration, run performance tests and tune as possible.

In the near term, I will also add a PR with Hive views for dns/flow/proxy
under spot-ml/ - this should enable an end-to-end proof-of-concept ODM
implementation using Envelope.

Thanks
Curtis

Configuration-driven ingest for the Open Data Model (ODM) using Spark Streaming (Envelope)

Reply via email to