Re: Multiple file join in map/reduce

Eric Sammer Mon, 21 Sep 2009 20:24:11 -0700

Ari Cooperman wrote:
> Sorry if this question is common, I looked through docs, code and mail
> archives and did not find everything that answered these questions......
> 
> Say I have 3 files A, B & C, each file has a set of records I want to
> parse through, and the record location is already indexed the same
> across files, i.e. the second record in A maps to the second record in
> B, which maps to the second record in C. However, the record lengths in
> each file are different and thus the file size and block counts are
> different. I want to be able to sometimes read one, two, or all of the
> files depending on my needs for the job run. What I would like to happen
> is that all the records for each file end up on the same host so that it
> is always local access. So ideally the block sizes would be different
> for each file so that the first block for A has the same record count as
> the first block for B, etc. So my questions are:
> 
> 1) I notice that on creating a file I can give a block size to the file,
> which would, if the records are fixed size, allow me to manually create
> equal record counts, but is this just a hint to the system? Will it be
> honored or could it use a different block size under certain conditions?
> 
> 2) Even if I can get the proper record counts split across the files, is
> there a way to make sure that the corresponding blocks across files are
> located on the same node? If so, is there a way to prevent them from
> being split up if the system rebalances data blocks?


I'm not an HDFS ninja, but I don't believe plain old HDFS will do what
you want in this case. There is (to my knowledge) no way to guarantee
block colocation for unrelated (or even the same) file on a given node.
Intuitively, I wouldn't suggest trying either because you're talking
about serious micromanagement. What happens if a node dies? Replica
levels would drop and the name node would do what it could to get them
back to where they should be, but to preserve colocation while observing
things like rack affinity would be tough to get right.

All of that said, I think you're best bet if you really need to
guarantee colocation of records is to either organize your data a
different way or to use something like HBase to get mostly there (i.e.
sparse high dimensional data within a "row"). I still don't think you'll
ever to have a "hard" guarantee on locality though.

Either way, I think the result is a sparse high dimensional format
(although HBase doesn't really store it that way - see references
below). This may not really matter in practice if you always "join" the
same way. Even if you didn't use HBase, you could still store all the
data in one large file (again, in a denormalized format) and use fixed
length jumps when you only want the data from file A (record length of A
+ B + C = distance to A record #2 or something like that).

Maybe you could find some HDFS voodoo to do what you want, but it sounds
tough to manage to me.

Other references:

HBase - Data Model - Physical Storage View
http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture

HDFS Design - Data Organization
http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html#Data+Organization

Hadoop : The Definitive Guide - HDFS chapter

Hope this helps!
-- 
Eric Sammer
[email protected]
http://esammer.blogspot.com

Re: Multiple file join in map/reduce

Reply via email to