Hi,

I have some questions regarding HBase and locality issues - I'd appreciate
some explanations and clarifications.

I understand HBase is built on top of HDFS.
Say an HRegionServer creates a HStoreFile where it puts some column family
content. Does HDFS split the file to multiple HDFS blocks and distributes
them around bunch of machines ? If that's the case, when the region server
needs to actually access the files, does HDFS underneath communicates remote
machines to read the various blocks ? Doesn't it hurt performance since
there is no  locality in data access (region server actually works on remote
blocks).
Or is the HStoreFile implemented in some other way which writes it to the
local disks of the region server node machine that owns it ? If so, then how
? Does this code overrides the HDFS behavior ?

Another related question is about Map Reduce and HBase. When a MapReduce
job  runs on top of HBase - i.e. gets  a table as an input. How does the
MapReduce  framework know how to schedule  map tasks near data ? Does it
have any knowledge of the actual location of the data pieces composing the
table to be processed ?

I'd be also glad to get pointers to the related source code (classes).

Thanks for any information,
Naama

-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)

Reply via email to