Re: [jira] Commented: (MAPREDUCE-989) Allow segregation of DistributedCache for maps and reduces

Milind A Bhandarkar Wed, 16 Sep 2009 21:47:15 -0700

Vinod

How do maps and reduces manage the caches "themselves" now?


If I understand this feature right, user specifies distributed cache  
per job, and the tasktracker makes sure that those cache files are  
present on the local disks before *any* task in that job executes. By  
"any task," I mean setup, map, reduce, or cleanup task. Setup and  
cleanup tasks are a particular concern because these are the gating  
tasks, which according to amdahl's law, restrict the scalability of  
the parallel application.

The usage scenario is this. A star-join, where the map task filters,  
projects, or and transforms, and a reduce task does the join of a  
large (1 TB) fact table and several smaller (total 4 GB) dimension  
tables. The dimention tables evolve slowly, say, once a month they are  
updated. Whereas, the fact tables are updated hourly.

So,  fact tables are partitioned before the reduce stage, and the  
reducers need dimension tables locally. Therefore, these dimension  
tables are fetched by reducers to do the join. If individual reducers  
fetch dimention tables explicitly, it would be an extra overhead if  
more than one reducers execute on a single tasktracker. If these  
tables are specified as job-level cache, then map tasks, which do not  
need these tables, get blocked unnecessarily when the tasktrackers  
fetch these into cache.

I hope this detailed explanation really explains the use case.

Re: [jira] Commented: (MAPREDUCE-989) Allow segregation of DistributedCache for maps and reduces

Reply via email to