On Wed, 27 Jul 2011 10:58:17 -0400, David Rosenstrauch <dar...@darose.net>
wrote:
> There is another, easier approach:  if your app inherits from the Tool 
> class / runs via ToolRunner, then your app can inherit the -libjars 
> command line functionality itself.

This is true; the problem with this approach is that we're also trying to
use cache persistence to our advantage.  Using libjars makes a new HDFS
copy (somewhere under /tmp) of the library every time I submit the job from
the command line, which has a new timestamp.  When the nodes check the
cache they see that their old copy of the library has an older timestamp
than the new one (despite the fact that they're actually the same file) and
asks for a new copy of the library for the local cache.

The upshot is: using libjars makes at most one local copy per node per job
run, but it will always make a new local copy each job run.  Adding the
library programatically can use older copies that are still cached from
previous job runs, leading to even less network overhead.

Reply via email to