I am looking into using Crail to store and access data on my compute
cluster that has nodes connected with InfiniBand.  I am trying to
understand how to make use of data locality in Crail and would appreciate
everyone's suggestions.  I'll describe my existing use case to illustrate
what I am trying to achieve.

Today I am populating data to ram disk on 8 nodes which are mapped into a
single namespace using NFS.  I then have a Spark job that is partitioned
and locality-aware so that it can execute an algorithm on this data
locally.  The Spark RDD actually only contains paths of the data files, not
the data itself.  I group files that are stored on the same node into 1
partition and map the path to the locality preference of the RDD
partition.  (I follow this odd approach because the algorithm is in C++ and
reads the data directly).

I would like to replace the NFS solution with Crail which seems more
flexible and configurable (I want to support hybrid ethernet/infiniband
clusters).  What I don't understand yet is:
- if I write a file to Crail, what is its locality?
- if I read a file from Crail, how can I know where it is stored and use
this information to feed Spark's locality preference? I.e. how would I
construct an RDD similar to the one I described above?
- is there a better way to use Crail, Spark and C++ together?  (I am trying
to avoid sending the data to C++ via the RDD.pipe() method).

Thanks,
Sumit

Reply via email to