keith-turner commented on a change in pull request #221: MASC blog post URL: https://github.com/apache/accumulo-website/pull/221#discussion_r384173992
########## File path: _posts/blog/2020-02-11-accumulo-spark-connector.md ########## @@ -0,0 +1,192 @@ +--- +title: Microsoft MASC, an Apache Spark connector for Apache Accumulo +author: Markus Cozowicz, Scott Graham +--- + +# Overview +MASC provides an Apache Spark native connector for Apache Accumulo to integrate the rich Spark machine learning eco-system with the scalable and secure data storage capabilities of Accumulo. + +## Major Features +- Simplified Spark DataFrame read/write to Accumulo using DataSource v2 API +- Speedup of 2-5x over existing approaches for pulling key-value data into DataFrame format +- Scala and Python support without overhead for moving between languages +- Process streaming data from Accumulo without loading it all into Spark memory +- Push down filtering with a flexible expression language ([JUEL](http://juel.sourceforge.net/)): user can define logical operators and comparisons to reduce the amount of data returned from Accumulo +- Column pruning based on selected fields transparently reduces the amount of data returned from Accumulo +- Server side inference: ML model inference can run on the Accumulo nodes using MLeap to increase the scalability of AI solutions as well as keeping data in Accumulo + +## Use-cases +There are many scenarios where use of this connector provides advantages, below we list a few common use-cases. Review comment: ```suggestion MASC is advantageous in many use-cases, below we list a few. ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
