Re: Using addCacheArchive

akhil1988 Wed, 01 Jul 2009 21:23:44 -0700

Hi Chris!

Sorry for the late reply!


To push the file into HDFS is clear to me and it can be done using "hadoop
fs -put" command also (prior to executing the job), which I generally use.

The method to access a file in HDFS from Mapper/Reducer is the following:
FileSystem fs = FileSystem.get(conf);
FSDataInputStream din = fs.open("/home/akhil1988/sample.txt");

The method (below)that you gave does not work:
Path cachePath= new Path("hdfs:///home/akhil1988/sample.txt");
BufferedReader wordReader = new BufferedReader(new
FileReader(cachePath.toString()));

A file in HDFS cannot be accessed through these standard Java function, it
has to be accessed only via the method I have mentioned above. The API
methods for FileSystem class are very limited and it only provides us to
read a data file (containing java primitives) and not any binary files.

In my specific problem, I am using a API (specific to my research-domain)
which takes a path (String) as input and reads data from this path (which
points to a binary file). So I just need a way in which I can access files
(from tasktrackers) as we do via standard java functions. For this, we need
the files to be present in the local filesystem of the tasktrackers. That is
why I am using DistributedCache. 

I hope I am clear?? And if I am wrong anywhere, please let me know.

Thanks,
Akhil





The API provides only this function to read a data file(containing java
primitives), we cannot read any binary files. 




Well, what I wanted was to have a directory in the local filesystem of the
tasktracker and not the HDFS because of the following reason:




Chris Curtin-2 wrote:
> 
> To push the file to HDFS (put it in the 'a_hdfsDirectory' directory)
> 
> Configuration config = new Configuration();
> FileSystem hdfs = FileSystem.get(config);
> Path srcPath = new Path(a_directory + "/" + outputName);
> Path dstPath = new Path(a_hdfsDirectory + "/" + outputName);
> hdfs.copyFromLocalFile(srcPath, dstPath);
> 
> 
> to read it from HDFS in your mapper or reducer:
> 
> Configuration config = new Configuration();
> FileSystem hdfs = FileSystem.get(config);
> Path cachePath= new Path(a_hdfsDirectory + "/" + outputName);
> BufferedReader wordReader = new BufferedReader(
>         new FileReader(cachePath.toString()));
> 
> 
> 
> On Fri, Jun 26, 2009 at 8:55 PM, akhil1988 <[email protected]> wrote:
> 
>>
>> Thanks Chris for your reply!
>>
>> Well, I could not understand much of what has been discussed on that
>> forum.
>> I am unaware of Cascading.
>>
>> My problem is simple - I want a directory to present in the local working
>> directory of tasks so that I can access it from my map task in the
>> following
>> manner :
>>
>> FileInputStream fin = new FileInputStream("Config/file1.config");
>>
>> where,
>> Config is a directory which contains many files/directories, one of which
>> is
>> file1.config
>>
>> It would be helpful to me if you can tell me what statements to use to
>> distribute a directory to the tasktrackers.
>> The API doc http://hadoop.apache.org/core/docs/r0.20.0/api/index.html
>> says
>> that archives are unzipped on the tasktrackers but I want an example of
>> how
>> to use this in case of a dreictory.
>>
>> Thanks,
>> Akhil
>>
>>
>>
>> Chris Curtin-2 wrote:
>> >
>> > Hi,
>> >
>> > I've found it much easier to write the file to HDFS use the API, then
>> pass
>> > the 'path' to the file in HDFS as a property. You'll need to remember
>> to
>> > clean up the file after you're done with it.
>> >
>> > Example details are in this thread:
>> >
>> http://groups.google.com/group/cascading-user/browse_thread/thread/d5c619349562a8d6#
>> >
>> > Hope this helps,
>> >
>> > Chris
>> >
>> > On Thu, Jun 25, 2009 at 4:50 PM, akhil1988 <[email protected]>
>> wrote:
>> >
>> >>
>> >> Please ask any questions if I am not clear above about the problem I
>> am
>> >> facing.
>> >>
>> >> Thanks,
>> >> Akhil
>> >>
>> >> akhil1988 wrote:
>> >> >
>> >> > Hi All!
>> >> >
>> >> > I want a directory to be present in the local working directory of
>> the
>> >> > task for which I am using the following statements:
>> >> >
>> >> > DistributedCache.addCacheArchive(new
>> URI("/home/akhil1988/Config.zip"),
>> >> > conf);
>> >> > DistributedCache.createSymlink(conf);
>> >> >
>> >> >>> Here Config is a directory which I have zipped and put at the
>> given
>> >> >>> location in HDFS
>> >> >
>> >> > I have zipped the directory because the API doc of DistributedCache
>> >> > (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says
>> that
>> >> the
>> >> > archive files are unzipped in the local cache directory :
>> >> >
>> >> > DistributedCache can be used to distribute simple, read-only
>> data/text
>> >> > files and/or more complex types such as archives, jars etc. Archives
>> >> (zip,
>> >> > tar and tgz/tar.gz files) are un-archived at the slave nodes.
>> >> >
>> >> > So, from my understanding of the API docs I expect that the
>> Config.zip
>> >> > file will be unzipped to Config directory and since I have SymLinked
>> >> them
>> >> > I can access the directory in the following manner from my map
>> >> function:
>> >> >
>> >> > FileInputStream fin = new FileInputStream("Config/file1.config");
>> >> >
>> >> > But I get the FileNotFoundException on the execution of this
>> statement.
>> >> > Please let me know where I am going wrong.
>> >> >
>> >> > Thanks,
>> >> > Akhil
>> >> >
>> >>
>> >> --
>> >> View this message in context:
>> >> http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html
>> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Using-addCacheArchive-tp24207739p24229338.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Using-addCacheArchive-tp24207739p24300915.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Using addCacheArchive

Reply via email to