Dennis Kubes wrote:
Ok, I have completed the testing and it is working good. One problem though. I noticed that we are using a distributed cache for the job files. If I am creating new job jar files on the fly, but still copying them to the job.jar location, how is this affected by distributed caching?
The cache will still be effective. Typically, in the course of a job, multiple map tasks and multiple reduce tasks run on each host. The cache retrieves just a single copy of the job's jar for all of these tasks.
However, with a new jar per job, the cache will not be effective across jobs. But this is not nearly as critical as caching across tasks in a job, since there are typically thousands of tasks per job. One could attempt to optimize across jobs, but I think that would be overkill, especially for the first version of this feature.
Doug
