I am not sure if you realize, but HDFS is not VM integrated.  What you are
asking for is support  *inside* the linux kernel for HDFS file systems. I
don't see that happening for the next few years, and probably never at all.
(HDFS is all Java today, and Java certainly is not going to go inside the
kernel)

The ways to get there are
a) use the hdfs-fuse proxy
b) do this by hand  - copy the file into each individual machine's local
disk, and then mmap the local path
c) more or less do the same as (b), using a thing called the "Distributed
Cache" in Hadoop, and them mmap the local path
d) don't use HDFS, and instead use something else for this purpose


On Tue, Apr 12, 2011 at 10:40 AM, Benson Margulies <bimargul...@gmail.com>wrote:

> Here's the OP again.
>
> I want to make it clear that my question here has to do with the
> problem of distributing 'the program' around the cluster, not 'the
> data'. In the case at hand, the issue a system that has a large data
> resource that it needs to do its work. Every instance of the code
> needs the entire model. Not just some blocks or pieces.
>
> Memory mapping is a very attractive tactic for this kind of data
> resource. The data is read-only. Memory-mapping it allows the
> operating system to ensure that only one copy of the thing ends up in
> physical memory.
>
> If we force the model into a conventional file (storable in HDFS) and
> read it into the JVM in a conventional way, then we get as many copies
> in memory as we have JVMs.  On a big machine with a lot of cores, this
> begins to add up.
>
> For people who are running a cluster of relatively conventional
> systems, just putting copies on all the nodes in a conventional place
> is adequate.
>

Reply via email to