Re: local directory

Ted Dunning Wed, 01 Jul 2009 13:07:23 -0700

You can write to any location you like on mapper machines, but you have no
guarantee that the data will be there the next time you run your program
(hadoop can move your jobs around as it pleases, machines may fail).  You
also have the problem that multiple copies of a mapper may be run.  Then you
have to decide which copy of the data that you want to keep around.


Generally it is *really* bad practice to do this.  If you want data to exist
after you finish your job, you should put it into HDFS.  An exception to
this rule would be log files (but you should use the logging framework for
that) another might be caching frequently used elements (but you should use
side data for that).

Take a look at this FAQ:  http://wiki.apache.org/hadoop/FAQ#9

In particular, look at the part about *${mapred.output.dir}/_${taskid}*

Even if you insist on trying to do this, make sure you deal with all of the
issues that are dealt with the FAQ that I am suggesting that you read.

On Wed, Jul 1, 2009 at 12:36 PM, bonito perdo
<[email protected]>wrote:

> I tried to store data in the local directory of each node inside the
> close()
> function of mapper. Particularly, I want to serialize an object and store
> it
> in a file (permanently) in the local disk of each node that currently
> executes the map phase.
>
> I Use this code:
> FileSystem fs = null;
> FSDataOutputStream out ;
> ObjectOutputStream obj;
> Path localOutPath;
>
>
>   in the configure( ) function of the mapper:
>             localOutPath = new Path( conf.get("mapred.local.dir"));
>
>            fs = localOutPath.getFileSystem(conf);
>            out = fs.create(localOutPath);
>            obj = new ObjectOutputStream(out);
>
> and in the close() function of the mapper:
>        obj.writeObject(someObject);
>        obj.close();
>
>
> Hoever, after checking the mapred.local.dir nothing is stored there. Having
> read that after each succesful rask this directory is deleted, I think that
> this might be the reason.
> Nonetheless, I really want to find a way to make each task able of writing
> local data to the local filesystem rather than to the hdfs.
>
> Thank you.
>
>
>
>
>
> On Wed, Jul 1, 2009 at 5:30 PM, bonito perdo <[email protected]
> >wrote:
>
> > Thank you Jason!
> >
> >
> > On Wed, Jul 1, 2009 at 5:26 PM, jason hadoop <[email protected]
> >wrote:
> >
> >> The directory returned by getWorkOutputPath is a task specific
> directory,
> >> to
> >> be used for files that should be part of the final output of the job.
> >>
> >> If you want to write to the task local directory, use the local file
> >> system
> >> api, and paths relative to '.'.
> >> The parameter mapred.local.dir will contain the name of the local
> >> directory.
> >>
> >>
> >> On Wed, Jul 1, 2009 at 9:19 AM, bonito perdo <
> [email protected]
> >> >wrote:
> >>
> >> > Thank you for you immediate response.
> >> > In this case, what is the difference with the path obtained from
> >> > FileOutputFormat.getWorkOutputPath(job)? this path refers to hdfs...
> >> >
> >> > Thank you.
> >> >
> >> >
> >> > On Wed, Jul 1, 2009 at 5:13 PM, jason hadoop <[email protected]>
> >> > wrote:
> >> >
> >> > > The parameter mapred.local.dir controls the directory used by the
> task
> >> > > tracker for map/reduce jobs local files.
> >> > >
> >> > > the dfs.data.dir paramter is for the datanode.
> >> > >
> >> > > On Wed, Jul 1, 2009 at 8:56 AM, bonito <[email protected]>
> >> wrote:
> >> > >
> >> > > >
> >> > > > Hello,
> >> > > > I am a bit confused about the local directories where each
> >> map/reduce
> >> > > task
> >> > > > can store data.
> >> > > > According to what I have read,
> >> > > > dfs.data.dir - is the path on the local file system in which the
> >> > DataNode
> >> > > > instance should store its data. That is, since we have a number of
> >> > > > individual nodes, this is the place where each node can store its
> >> own
> >> > > data.
> >> > > > Right?
> >> > > > This data may be part of a-let's say- file stored under the hdfs
> >> > > namespace?
> >> > > > The value of this property for my configuration is:
> >> > > >
>  /home/bon/my_hdfiles/temp_0.19.1/dfs/data.
> >> > > > As far as I can understand this path refers to the local "disk" of
> >> each
> >> > > > node.
> >> > > >
> >> > > > Moreover, calling FileOutputFormat.getWorkOutputPath(job) we
> obtain
> >> the
> >> > > > Path
> >> > > > to the task's temporary output directory for the map-reduce job.
> >> This
> >> > > path
> >> > > > is totally different than the previous which confuses me since the
> >> > > > temporary
> >> > > > output of each task should be written locally in the node's disk.
> >> The
> >> > > path
> >> > > > I
> >> > > > retrieve is:
> >> > > >
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> hdfs://localhost:9000/user/bon/keys_fil.txt/_temporary/_attempt_200907011515_0009_m_000000_0
> >> > > > Does this path refer to the local disk (node)? Or is it possible
> >> that
> >> > it
> >> > > > may
> >> > > > refer to another node in the cluster?
> >> > > >
> >> > > > Any clarification would be of great help.
> >> > > >
> >> > > > Thank you.
> >> > > > --
> >> > > > View this message in context:
> >> > > > http://www.nabble.com/local-directory-tp24292289p24292289.html
> >> > > > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >> > > >
> >> > > >
> >> > >
> >> > >
> >> > > --
> >> > > Pro Hadoop, a book to guide you from beginner to hadoop mastery,
> >> > > http://www.amazon.com/dp/1430219424?tag=jewlerymall
> >> > > www.prohadoopbook.com a community for Hadoop Professionals
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Pro Hadoop, a book to guide you from beginner to hadoop mastery,
> >> http://www.amazon.com/dp/1430219424?tag=jewlerymall
> >> www.prohadoopbook.com a community for Hadoop Professionals
> >>
> >
> >
>



-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)

Re: local directory

Reply via email to