I am looking into using Crail to store and access data on my compute cluster that has nodes connected with InfiniBand. I am trying to understand how to make use of data locality in Crail and would appreciate everyone's suggestions. I'll describe my existing use case to illustrate what I am trying to achieve.
Today I am populating data to ram disk on 8 nodes which are mapped into a single namespace using NFS. I then have a Spark job that is partitioned and locality-aware so that it can execute an algorithm on this data locally. The Spark RDD actually only contains paths of the data files, not the data itself. I group files that are stored on the same node into 1 partition and map the path to the locality preference of the RDD partition. (I follow this odd approach because the algorithm is in C++ and reads the data directly). I would like to replace the NFS solution with Crail which seems more flexible and configurable (I want to support hybrid ethernet/infiniband clusters). What I don't understand yet is: - if I write a file to Crail, what is its locality? - if I read a file from Crail, how can I know where it is stored and use this information to feed Spark's locality preference? I.e. how would I construct an RDD similar to the one I described above? - is there a better way to use Crail, Spark and C++ together? (I am trying to avoid sending the data to C++ via the RDD.pipe() method). Thanks, Sumit