Hi Liquan, Sparrow is not currently integrated into the Spark distribution, so if you'd like to use Spark with Sparrow, you need to use a forked version of Spark (https://github.com/kayousterhout/spark/tree/sparrow). This version of Spark was forked off an older version of Spark so some work will be involved to bring this up to date with the latest version of Spark; I can help with this.
Unfortunately there are also a few practical problems with using Sparrow with Spark that may or may not be compatible with your target workload. Sparrow distributes scheduling over many Sparrow schedulers that are each associated with their own Spark driver (this is where Sparrow's improvements stem from -- there's no longer a single driver serving as the bottleneck for your application, but all of the schedulers/drivers share the same slots for scheduling tasks). As a result, data stored in Spark's block manager on one Spark driver (and created as part of a job scheduled by the associated Sparrow scheduler) cannot be accessed by other Spark drivers. If you're storing data in Tachyon or have a workload where different jobs have disjoint working sets, this won't be an issue. -Kay On Fri, Jun 20, 2014 at 5:47 PM, Liquan Pei <liquan...@gmail.com> wrote: > Hi > > What is the current status of Sparrow integration with Spark? I would like > to integrate Sparrow with Spark 1.0 on a 100 node cluster. Any suggestions? > > Thanks a lot for your help! > Liquan >