On Wed, 27 Jul 2011 10:58:17 -0400, David Rosenstrauch <dar...@darose.net> wrote: > There is another, easier approach: if your app inherits from the Tool > class / runs via ToolRunner, then your app can inherit the -libjars > command line functionality itself.
This is true; the problem with this approach is that we're also trying to use cache persistence to our advantage. Using libjars makes a new HDFS copy (somewhere under /tmp) of the library every time I submit the job from the command line, which has a new timestamp. When the nodes check the cache they see that their old copy of the library has an older timestamp than the new one (despite the fact that they're actually the same file) and asks for a new copy of the library for the local cache. The upshot is: using libjars makes at most one local copy per node per job run, but it will always make a new local copy each job run. Adding the library programatically can use older copies that are still cached from previous job runs, leading to even less network overhead.