Re: Crail and data locality question

Patrick Stuedi Thu, 14 Jun 2018 11:23:50 -0700

Hey Sumit,

Data locality in Crail simply means that there is a locality API
(getLocations(file, offset, length) that can be called by applications to
find out where a particular range of data is physically located, i.e. where
the corresponding blocks are. Crail has such an API, like HDFS also has
such an API. The locality API is typically used the Spark scheduler (or the
Hadoop scheduler, etc.) when scheduling tasks. For instance most of these
schedulers try to make sure tasks are executed on the machines that have
the corresponding data. With Crail this happens naturally if you put the
data into Crail and use Crail as an input source through the HDFS adaptor.


There is also an fsck tool that allows you to query the physical locations
of the blocks for a file from the shell, try ./bin/crail fsck

Let me know if that helps.

-Patrick



On Thu, Jun 14, 2018 at 7:58 PM, Sumit Sen <sumit.1....@gmail.com> wrote:

> I am looking into using Crail to store and access data on my compute
> cluster that has nodes connected with InfiniBand.  I am trying to
> understand how to make use of data locality in Crail and would appreciate
> everyone's suggestions.  I'll describe my existing use case to illustrate
> what I am trying to achieve.
>
> Today I am populating data to ram disk on 8 nodes which are mapped into a
> single namespace using NFS.  I then have a Spark job that is partitioned
> and locality-aware so that it can execute an algorithm on this data
> locally.  The Spark RDD actually only contains paths of the data files, not
> the data itself.  I group files that are stored on the same node into 1
> partition and map the path to the locality preference of the RDD
> partition.  (I follow this odd approach because the algorithm is in C++ and
> reads the data directly).
>
> I would like to replace the NFS solution with Crail which seems more
> flexible and configurable (I want to support hybrid ethernet/infiniband
> clusters).  What I don't understand yet is:
> - if I write a file to Crail, what is its locality?
> - if I read a file from Crail, how can I know where it is stored and use
> this information to feed Spark's locality preference? I.e. how would I
> construct an RDD similar to the one I described above?
> - is there a better way to use Crail, Spark and C++ together?  (I am trying
> to avoid sending the data to C++ via the RDD.pipe() method).
>
> Thanks,
> Sumit
>

Re: Crail and data locality question

Reply via email to