keith-turner commented on a change in pull request #221: MASC blog post URL: https://github.com/apache/accumulo-website/pull/221#discussion_r384169263
########## File path: _posts/blog/2020-02-11-accumulo-spark-connector.md ########## @@ -0,0 +1,192 @@ +--- +title: Microsoft MASC, an Apache Spark connector for Apache Accumulo +author: Markus Cozowicz, Scott Graham +--- + +# Overview +MASC provides an Apache Spark native connector for Apache Accumulo to integrate the rich Spark machine learning eco-system with the scalable and secure data storage capabilities of Accumulo. + +## Major Features +- Simplified Spark DataFrame read/write to Accumulo using DataSource v2 API +- Speedup of 2-5x over existing approaches for pulling key-value data into DataFrame format +- Scala and Python support without overhead for moving between languages +- Process streaming data from Accumulo without loading it all into Spark memory +- Push down filtering with a flexible expression language ([JUEL](http://juel.sourceforge.net/)): user can define logical operators and comparisons to reduce the amount of data returned from Accumulo +- Column pruning based on selected fields transparently reduces the amount of data returned from Accumulo +- Server side inference: ML model inference can run on the Accumulo nodes using MLeap to increase the scalability of AI solutions as well as keeping data in Accumulo + +## Use-cases +There are many scenarios where use of this connector provides advantages, below we list a few common use-cases. + +**Scenario 1**: A data analyst needs to execute model inference on large amount of data in Accumulo.<br> +**Benefit**: Instead of transferring all the data to a large Spark cluster to score using a Spark model, the model can be exported and pushed down using the connector to run on the Accumulo cluster. This can reduce the need for a large Spark cluster as well as the amount of data transferred between systems, and can improve inference speeds (>2x speedups observed). Review comment: ```suggestion **Benefit**: Instead of transferring all the data to a large Spark cluster to score using a Spark model, the connector exports and runs the model on the Accumulo cluster. This reduces the need for a large Spark cluster as well as the amount of data transferred between systems, and can improve inference speeds (>2x speedups observed). ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
