Re: questions about HDFS file access synchronization

Aaron Kimball Wed, 05 Aug 2009 10:38:26 -0700

On Wed, Aug 5, 2009 at 6:09 AM, Zhang Bingjun (Eddy) <[email protected]>wrote:


> Hi All,
>
> I am quite new to Hadoop. May I ask a simple question about HDFS file
> access
> synchronization?
>
> For some very typical scenarios below, how does HDFS respond? Is there a
> way
> to synchronize file access in HDFS?
>
> A tries to read a file currently being written by B.


There is no sync() call in HDFS. A will read whatever portion of B's data
has already been committed to disk by the datanode. It is unspecified how
much data this will contain. It may be variable depending on which replica
of the file A is reading. After B close()'s the file, all the data will be
available to A.


>
> A tries to write a file currently being written by B.


This will fail. HDFS does not allow multiple writers to a file. The
FileSystem.create() call used by A to open the file for write access will
throw IOException.


>
> A tries to write a file currently being read by B.


This will fail. HDFS does not allow file updates, so if the file already
exists and B is reading it, the FileSystem.create() call used by A will fail
with IOException.


>
>
> We plan to put some shared data in HDFS so that multiple applications can
> share the data between them. The ideal case is that the underlying
> distributed file system (HDFS here) will provide file access
> synchronization
> so that applications know when they can or cannot operate on a certain
> file.
> Is this way of thinking correct? What is the typical design for this kind
> of
> application scenario?


You'll have to think carefully. You can't update files. There is also no
equivalent of flock(), so you can't use files as locks for exclusive access
to some part of a work flow. If that's what you need, you may want to look
at the ZooKeeper project and see if you can't integrate ZK into your system.
ZK is specifically designed to handle locking, mutual exclusion, and other
distributed synchronization problems.



>
>
> I am quite confused. Definitely need to read more about HDFS and other
> distributed file systems. But before that, I would appreciate very much the
> input from experts in the mailing list.


http://hadoop.apache.org/common/docs/r0.20.0/hdfs_user_guide.html and
http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html are good
places to start.


>
>
> Thanks a lot!
>
> Best regards,
> Zhang Bingjun (Eddy)
>
> E-mail: [email protected], [email protected], [email protected]
> Tel No: +65-96188110 (M)
>

Cheers,
- Aaron

Re: questions about HDFS file access synchronization

Reply via email to