Hi all, We had discussed prototyping Envelope for ingest in the past - I've submitted a PR for this which includes: - Kafka -> Spark streaming -> ODM Hive table applications for dns, flow and proxy raw source data - a simple alternative for source data collection/dissection using tshark/nfdump/unzip + Flume (sinking data to Kafka) - https://github.com/apache/incubator-spot/pull/144
To quote directly from the Envelope site (https://github.com/cloudera- labs/envelope#envelope): *"Envelope is simply a pre-made Spark application that implements many of the tasks commonly found in ETL pipelines. In many cases, Envelope allows large pipelines to be developed on Spark with no coding required. When custom code is needed, there are pluggable points in Envelope for core functionality to be extended. Envelope works in batch and streaming modes."* For example, the complete Kafka/SparkStreaming/ODM ingest application definition for DNS: https://github.com/curtishoward/incubator-spot/ blob/SPOT-181_envelope_ingest/spot-ingest/odm/workers/spot_proxy.conf >From the perspective of the Spot project, my thoughts are that it would enable: - faster turnaround time to ingest new source types while still allowing for arbitrarily complex ETL pipelines (data enrichment, data quality checks, etc..) - simplify future integration with other storage layers (HBase, Kudu, for example) - a framework that is simple to extend (input sources, output storage layers, translators, derivers, UDFs, ...) If there is interest, I will continue to refactor the current implementation - centralize/integration configuration with spot.conf, test Kerberos integration, run performance tests and tune as possible. In the near term, I will also add a PR with Hive views for dns/flow/proxy under spot-ml/ - this should enable an end-to-end proof-of-concept ODM implementation using Envelope. Thanks Curtis