Re: Query regarding internal/working of hadoop fs -copyFromLocal and fs.write()

Florin P Fri, 03 Jun 2011 01:27:04 -0700

Hello!
   I'm also interested in this subject. I have read on the HDFS Architecture 
Guide (http://hadoop.apache.org/common/docs/current/hdfs_design.html), in the 
"Staging" paragraph the following (cite)

"Staging
A client request to create a file does not reach the NameNode immediately. In 
fact, initially the HDFS client caches the file data into a temporary local 
file. Application writes are transparently redirected to this temporary local 
file. When the local file accumulates data worth over one HDFS block size, the 
client contacts the NameNode. The NameNode inserts the file name into the file 
system hierarchy and allocates a data block for it. The NameNode responds to 
the client request with the identity of the DataNode and the destination data 
block. Then the client flushes the block of data from the local temporary file 
to the specified DataNode. When a file is closed, the remaining un-flushed data 
in the temporary local file is transferred to the DataNode. The client then 
tells the NameNode that the file is closed. At this point, the NameNode commits 
the file creation operation into a persistent store. If the NameNode dies 
before the file is closed, the
 file is lost.

The above approach has been adopted after careful consideration of target 
applications that run on HDFS. These applications need streaming writes to 
files. If a client writes to a remote file directly without any client side 
buffering, the network speed and the congestion in the network impacts 
throughput considerably. This approach is not without precedent. Earlier 
distributed file systems, e.g. AFS, have used client side caching to improve 
performance. A POSIX requirement has been relaxed to achieve higher performance 
of data uploads."
 Also, on a forum 
(http://web.archiveorange.com/archive/v/JJ4pfwI4PD9PHqvjfMxk), a guy (Allen 
Wittenauer) states:
"When you write on a machine running a datanode process, the data is *always* 
written locally first.  This is to provide an optimization to the MapReduce 
framework.   The lesson here is that you should *never* use a datanode machine 
to load your data.  Always do it outside the grid."

Therefore, I'm a puzzled about your answers. I would like to hear more opinions 
that will finally clarify this subject.
  Thank you.
  Florin

--- On Tue, 5/31/11, Joey Echeverria <j...@cloudera.com> wrote:

> From: Joey Echeverria <j...@cloudera.com>
> Subject: Re: Query regarding internal/working of hadoop fs -copyFromLocal and 
> fs.write()
> To: mapreduce-u...@hadoop.apache.org
> Cc: hdfs-user@hadoop.apache.org, cdh-u...@cloudera.org
> Date: Tuesday, May 31, 2011, 8:05 PM
> They write directly to HDFS, there's
> no additional buffering on the
> local file system of the client.
> 
> -Joey
> 
> On Tue, May 31, 2011 at 7:56 PM, Mapred Learn <mapred.le...@gmail.com>
> wrote:
> > Hi guys,
> > I asked this question earlier but did not get any
> response. So, posting
> > again. Hope somebody can point to the right
> description:
> >
> > When you do hadoop fs -copyFromLocal or use API to
> call fs.write() (when
> > Filesystem fs is HDFS), does it write to local
> filesystem first before
> > writing to HDFS ?
> >
> > I read and found out that it writes on local
> file-system until block-size is
> > reached and then writes on HDFS.
> > Wouldn't HDFS Client choke if it writes to local
> filesystem if multiple such
> > fs -copyFromLocal commands are running. I thought
> atleast in fs.write(), if
> > you provide byte array, it should not write on local
> file-system ?
> >
> > Some places I found out that hdfs client and datanode
> communicate through
> > rpc/sockets. Do they write on local file-systems also
> in this case or is it
> > just a buffer in memory that they write directly on
> HDFS.
> > Could somebody point me to some doc/code where I could
> find out how fs
> > -copyFromLocal and fs.write() work ? Do they write on
> local-filesystem
> > before block size is reached and then write to HDFS or
> write directly to
> > HDFS ?
> >
> > Thanks in advance,
> > -JJ
> 
> 
> 
> -- 
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434
>

Re: Query regarding internal/working of hadoop fs -copyFromLocal and fs.write()

Reply via email to