Re: Hadoop streaming cacheArchive

2008-03-20 Thread Norbert Burger
Amareshwari, thanks for your help.  This turned out to be user error (when
packaging my JAR, I inadvertently included a lib directory, so the libraries
actually existed in HDFS as ./lib/lib/perl..., when I was only expecting
./lib/perl...

Thanks again,
Norbert

On Thu, Mar 20, 2008 at 3:03 AM, Amareshwari Sriramadasu <
[EMAIL PROTECTED]> wrote:

> Norbert Burger wrote:
> > I'm trying to use the cacheArchive command-line options with the
> > hadoop-0.15.3-streaming.jar.  I'm using the option as follows:
> >
> > -cacheArchive hdfs://host:50001/user/root/lib.jar#lib
> >
> > Unfortunately, my PERL scripts fail with an error consistent with not
> being
> > able to find the 'lib' directory (which, as I understand, should point
> back
> > to an extracted version of the lib.jar).
> >
> >
> Here, lib is created as a symlink in task's working directory. It will
> have the jar file and extracted version of jar file.
> Where are your PERL scripts searching for the lib? Is '.' included in
> your classpath.
> Otherwise you can use "mapred.job.classpath.archives" config item, this
> adds the files to the classpath and also to the distributed cache
> you can use
>   -jobconf
> "mapred.job.classpath.archives=hdfs://host:50001/user/root/lib.jar#lib"
> > I know that the original JAR exists in HDFS, but I don't see any
> evidence of
> > lib.jar or a link called 'lib' inside my job.jar.
> link 'lib' will not be part of job.jar, but it will be distributed on
> all the nodes during task launch and task's current working directory
> will have the link 'lib' to the jar on cache.
> > How can I troubleshoot
> > cacheArchive further?  Should the files/dirs specified via cacheArchive
> be
> > contained inside the job.jar?  If not, where should they be in HDFS?
> >
> >
> They can be anywhere on HDFS. You need give the complete path to add it
> to the cache.
> > Thanks for any help.
> >
> > Norbert
> >
> >
>
>


Re: Hadoop streaming cacheArchive

2008-03-20 Thread Amareshwari Sriramadasu

Norbert Burger wrote:

I'm trying to use the cacheArchive command-line options with the
hadoop-0.15.3-streaming.jar.  I'm using the option as follows:

-cacheArchive hdfs://host:50001/user/root/lib.jar#lib

Unfortunately, my PERL scripts fail with an error consistent with not being
able to find the 'lib' directory (which, as I understand, should point back
to an extracted version of the lib.jar).

  
Here, lib is created as a symlink in task's working directory. It will 
have the jar file and extracted version of jar file.
Where are your PERL scripts searching for the lib? Is '.' included in 
your classpath.
Otherwise you can use "mapred.job.classpath.archives" config item, this 
adds the files to the classpath and also to the distributed cache

you can use
  -jobconf 
"mapred.job.classpath.archives=hdfs://host:50001/user/root/lib.jar#lib"

I know that the original JAR exists in HDFS, but I don't see any evidence of
lib.jar or a link called 'lib' inside my job.jar.  
link 'lib' will not be part of job.jar, but it will be distributed on 
all the nodes during task launch and task's current working directory 
will have the link 'lib' to the jar on cache.

How can I troubleshoot
cacheArchive further?  Should the files/dirs specified via cacheArchive be
contained inside the job.jar?  If not, where should they be in HDFS?

  
They can be anywhere on HDFS. You need give the complete path to add it 
to the cache.

Thanks for any help.

Norbert

  




Hadoop streaming cacheArchive

2008-03-19 Thread Norbert Burger
I'm trying to use the cacheArchive command-line options with the
hadoop-0.15.3-streaming.jar.  I'm using the option as follows:

-cacheArchive hdfs://host:50001/user/root/lib.jar#lib

Unfortunately, my PERL scripts fail with an error consistent with not being
able to find the 'lib' directory (which, as I understand, should point back
to an extracted version of the lib.jar).

I know that the original JAR exists in HDFS, but I don't see any evidence of
lib.jar or a link called 'lib' inside my job.jar.  How can I troubleshoot
cacheArchive further?  Should the files/dirs specified via cacheArchive be
contained inside the job.jar?  If not, where should they be in HDFS?

Thanks for any help.

Norbert