Curtis this is very cool thanks for putting so much time into this will check out the PR and comment.
On Tue, May 1, 2018 at 3:37 PM Curtis Howard <[email protected]> wrote: > Hi Nathanael, > > So far only https://github.com/Open-Network-Insight/spot-nfdump.git > > The PR code is a proof-of-concept at this point - look forward to your > thoughts on next steps though! > > Thanks again > Curtis > > On Tue, May 1, 2018 at 6:28 PM, Nate Smith <[email protected]> wrote: > > > Curtis, > > > > Have you tested this with a standard version of nfdump? Or only > > spot-nfdump? > > > > - Nathanael > > > > > On May 1, 2018, at 1:12 PM, Curtis Howard <[email protected]> wrote: > > > > > > Hi all, > > > > > > We had discussed prototyping Envelope for ingest in the past - I've > > > submitted a PR for this which includes: > > > - Kafka -> Spark streaming -> ODM Hive table applications for dns, > flow > > > and proxy raw source data > > > - a simple alternative for source data collection/dissection using > > > tshark/nfdump/unzip + Flume (sinking data to Kafka) > > > - https://github.com/apache/incubator-spot/pull/144 > > > > > > To quote directly from the Envelope site (https://github.com/cloudera- > > > labs/envelope#envelope): > > > *"Envelope is simply a pre-made Spark application that implements many > of > > > the tasks commonly found in ETL pipelines. In many cases, Envelope > allows > > > large pipelines to be developed on Spark with no coding required. When > > > custom code is needed, there are pluggable points in Envelope for core > > > functionality to be extended. Envelope works in batch and streaming > > modes."* > > > > > > For example, the complete Kafka/SparkStreaming/ODM ingest application > > > definition for DNS: > > > https://github.com/curtishoward/incubator-spot/ > > > blob/SPOT-181_envelope_ingest/spot-ingest/odm/workers/spot_proxy.conf > > > > > > From the perspective of the Spot project, my thoughts are that it would > > > enable: > > > - faster turnaround time to ingest new source types while still > allowing > > > for arbitrarily complex ETL pipelines (data enrichment, data quality > > > checks, etc..) > > > - simplify future integration with other storage layers (HBase, Kudu, > > for > > > example) > > > - a framework that is simple to extend (input sources, output storage > > > layers, translators, derivers, UDFs, ...) > > > > > > If there is interest, I will continue to refactor the current > > > implementation - centralize/integration configuration with spot.conf, > > test > > > Kerberos integration, run performance tests and tune as possible. > > > > > > In the near term, I will also add a PR with Hive views for > dns/flow/proxy > > > under spot-ml/ - this should enable an end-to-end proof-of-concept ODM > > > implementation using Envelope. > > > > > > Thanks > > > Curtis > > > > >
