Vinod How do maps and reduces manage the caches "themselves" now?
If I understand this feature right, user specifies distributed cache per job, and the tasktracker makes sure that those cache files are present on the local disks before *any* task in that job executes. By "any task," I mean setup, map, reduce, or cleanup task. Setup and cleanup tasks are a particular concern because these are the gating tasks, which according to amdahl's law, restrict the scalability of the parallel application. The usage scenario is this. A star-join, where the map task filters, projects, or and transforms, and a reduce task does the join of a large (1 TB) fact table and several smaller (total 4 GB) dimension tables. The dimention tables evolve slowly, say, once a month they are updated. Whereas, the fact tables are updated hourly. So, fact tables are partitioned before the reduce stage, and the reducers need dimension tables locally. Therefore, these dimension tables are fetched by reducers to do the join. If individual reducers fetch dimention tables explicitly, it would be an extra overhead if more than one reducers execute on a single tasktracker. If these tables are specified as job-level cache, then map tasks, which do not need these tables, get blocked unnecessarily when the tasktrackers fetch these into cache. I hope this detailed explanation really explains the use case.
