We recently released an object store connector for Spark. 
https://github.com/SparkTC/stocator
Currently this connector contains driver for the Swift based object store 
( like SoftLayer or any other Swift cluster ), but it can easily support 
additional object stores.
There is a pending patch to support Amazon S3 object store. 

The major highlights that this connector doesn't create any temporary 
files  and so it achieves very fast response times when Spark persist data 
in the object store.
The new connector supports speculate mode and covers various failure 
scenarios ( like two Spark tasks writing into same object, partial 
corrupted data due to run time exceptions in Spark master, etc ).  It also 
covers https://issues.apache.org/jira/browse/SPARK-10063 and other known 
issues.

The detail algorithm for fault tolerance will be released very soon. For 
now, those who interested, can view the implementation in the code itself.

 https://github.com/SparkTC/stocator contains all the details how to setup 
and use with Spark.

A series of tests showed that the new connector obtains 70% improvements 
for write operations from Spark to Swift and about 30% improvements for 
read operations from Swift into Spark ( comparing to the existing driver 
that Spark uses to integrate with objects stored in Swift). 

There is an ongoing work to add more coverage and fix some known bugs / 
limitations.

All the best
Gil



Reply via email to