Multiple file join in map/reduce

Ari Cooperman Mon, 21 Sep 2009 18:13:10 -0700

Sorry if this question is common, I looked through docs, code and mailarchives and did not find everything that answered these questions......

Say I have 3 files A, B & C, each file has a set of records I want toparse through, and the record location is already indexed the sameacross files, i.e. the second record in A maps to the second record inB, which maps to the second record in C. However, the record lengthsin each file are different and thus the file size and block counts aredifferent. I want to be able to sometimes read one, two, or all of thefiles depending on my needs for the job run. What I would like tohappen is that all the records for each file end up on the same hostso that it is always local access. So ideally the block sizes would bedifferent for each file so that the first block for A has the samerecord count as the first block for B, etc. So my questions are:

1) I notice that on creating a file I can give a block size to thefile, which would, if the records are fixed size, allow me to manuallycreate equal record counts, but is this just a hint to the system?Will it be honored or could it use a different block size undercertain conditions?

2) Even if I can get the proper record counts split across the files,is there a way to make sure that the corresponding blocks across filesare located on the same node? If so, is there a way to prevent themfrom being split up if the system rebalances data blocks?


Thanks for any help....

Multiple file join in map/reduce

Reply via email to