ktzoulas opened a new pull request #141: Ingestion using Spark Streaming
URL: https://github.com/apache/incubator-spot/pull/141
 
 
   
   ## Ingestion using Spark Streaming
   A new branch of Spot Ingest framework, where ingestion via Spark Streaming 
[**Spark2**] is available for flow, dns and proxy. The code was developed 
without modifying the existing one, to provide this feature as an extra  
functionality. This means that the current functionality of the Spot Ingest 
framework still exists and the **Streaming Ingestion** can be used **as an 
alternative**.
   
   ### Implementation
   A new collector class [**Distributed Collector**] has been added, which is 
the same for all pipelines. The role of the Distributed Collector is similar, 
as it processes the data before transmission. Distributed Collector tracks a 
directory backwards for newly created files. When a file is detected, it 
converts it into CSV format and stores the output in the local staging area. 
Following to that, reads the CSV file line-by-line and creates smaller chunks 
of bytes. The size of each chunk depends on the maximum request size allowed by 
Kafka. Finally, it serializes each chunk into an Avro-encoded format and 
publishes them to Kafka cluster.
   Due to its architecture, Distributed Collector can run **on an edge node** 
of the Big Data infrastructure as well as **on a remote host** (proxy server, 
vNSF, etc).
   Distributed Collector publishes to Apache Kafka only the CSV-converted file, 
and not the original one. The binary file remains to the local filesystem of 
the current host.
   
   In contrary, **Streaming Listener** can only run on the central 
infrastructure. Its ability is to listen to a specific Kafka topic and consumes 
incoming messages. Streaming data is divided into batches (according to a time 
interval). These batches are deserialized by the Listener, according to the 
supported Avro schema, parsed and registered in the corresponding table of Hive.
   
   For each pipeline, the following modules have been implemented:
   *  **processing** - contains methods that will be used to convert and 
prepare ingested data, before being sent to Kafka cluster.
   * **streaming** - contains methods to be used during the streaming process 
(like table schema etc.).
   
   ### Advantages
   A preliminary traffic ingestion check for all pipelines (flow,dns,proxy) of 
the distributed collector-worker modules has been applied on an existing Spot 
v1.0 installation. In a summary, the module has flawless performance on netflow 
files (both .nfcapd and .csv), order of **magnitude quicker** compared to 
standard version's ingestion. DNS and proxy ingestion were tested as well with 
similar results.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to