Sorry if this question is common, I looked through docs, code and mail
archives and did not find everything that answered these questions......
Say I have 3 files A, B & C, each file has a set of records I want to
parse through, and the record location is already indexed the same
across files, i.e. the second record in A maps to the second record in
B, which maps to the second record in C. However, the record lengths
in each file are different and thus the file size and block counts are
different. I want to be able to sometimes read one, two, or all of the
files depending on my needs for the job run. What I would like to
happen is that all the records for each file end up on the same host
so that it is always local access. So ideally the block sizes would be
different for each file so that the first block for A has the same
record count as the first block for B, etc. So my questions are:
1) I notice that on creating a file I can give a block size to the
file, which would, if the records are fixed size, allow me to manually
create equal record counts, but is this just a hint to the system?
Will it be honored or could it use a different block size under
certain conditions?
2) Even if I can get the proper record counts split across the files,
is there a way to make sure that the corresponding blocks across files
are located on the same node? If so, is there a way to prevent them
from being split up if the system rebalances data blocks?
Thanks for any help....
- Multiple file join in map/reduce Ari Cooperman
-