[GitHub] [accumulo-website] keith-turner commented on a change in pull request #221: MASC blog post

GitBox Tue, 25 Feb 2020 15:03:53 -0800

keith-turner commented on a change in pull request #221: MASC blog post
URL: https://github.com/apache/accumulo-website/pull/221#discussion_r384169263


 ##########
 File path: _posts/blog/2020-02-11-accumulo-spark-connector.md
 ##########
 @@ -0,0 +1,192 @@
+---
+title: Microsoft MASC, an Apache Spark connector for Apache Accumulo
+author: Markus Cozowicz, Scott Graham
+---
+
+# Overview
+MASC provides an Apache Spark native connector for Apache Accumulo to 
integrate the rich Spark machine learning eco-system with the scalable and 
secure data storage capabilities of Accumulo. 
+
+## Major Features
+- Simplified Spark DataFrame read/write to Accumulo using DataSource v2 API
+- Speedup of 2-5x over existing approaches for pulling key-value data into 
DataFrame format
+- Scala and Python support without overhead for moving between languages
+- Process streaming data from Accumulo without loading it all into Spark memory
+- Push down filtering with a flexible expression language 
([JUEL](http://juel.sourceforge.net/)): user can define logical operators and 
comparisons to reduce the amount of data returned from Accumulo 
+- Column pruning based on selected fields transparently reduces the amount of 
data returned from Accumulo
+- Server side inference: ML model inference can run on the Accumulo nodes 
using MLeap to increase the scalability of AI solutions as well as keeping data 
in Accumulo
+
+## Use-cases
+There are many scenarios where use of this connector provides advantages, 
below we list a few common use-cases.
+
+**Scenario 1**: A data analyst needs to execute model inference on large 
amount of data in Accumulo.<br>
+**Benefit**: Instead of transferring all the data to a large Spark cluster to 
score using a Spark model, the model can be exported and pushed down using the 
connector to run on the Accumulo cluster. This can reduce the need for a large 
Spark cluster as well as the amount of data transferred between systems, and 
can improve inference speeds (>2x speedups observed).
 
 Review comment:
   ```suggestion
   **Benefit**: Instead of transferring all the data to a large Spark cluster 
to score using a Spark model, the connector exports and runs the model on the 
Accumulo cluster. This reduces the need for a large Spark cluster as well as 
the amount of data transferred between systems, and can improve inference 
speeds (>2x speedups observed).
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [accumulo-website] keith-turner commented on a change in pull request #221: MASC blog post

Reply via email to