On Wed, Aug 5, 2009 at 6:09 AM, Zhang Bingjun (Eddy) <[email protected]>wrote:
> Hi All, > > I am quite new to Hadoop. May I ask a simple question about HDFS file > access > synchronization? > > For some very typical scenarios below, how does HDFS respond? Is there a > way > to synchronize file access in HDFS? > > A tries to read a file currently being written by B. There is no sync() call in HDFS. A will read whatever portion of B's data has already been committed to disk by the datanode. It is unspecified how much data this will contain. It may be variable depending on which replica of the file A is reading. After B close()'s the file, all the data will be available to A. > > A tries to write a file currently being written by B. This will fail. HDFS does not allow multiple writers to a file. The FileSystem.create() call used by A to open the file for write access will throw IOException. > > A tries to write a file currently being read by B. This will fail. HDFS does not allow file updates, so if the file already exists and B is reading it, the FileSystem.create() call used by A will fail with IOException. > > > We plan to put some shared data in HDFS so that multiple applications can > share the data between them. The ideal case is that the underlying > distributed file system (HDFS here) will provide file access > synchronization > so that applications know when they can or cannot operate on a certain > file. > Is this way of thinking correct? What is the typical design for this kind > of > application scenario? You'll have to think carefully. You can't update files. There is also no equivalent of flock(), so you can't use files as locks for exclusive access to some part of a work flow. If that's what you need, you may want to look at the ZooKeeper project and see if you can't integrate ZK into your system. ZK is specifically designed to handle locking, mutual exclusion, and other distributed synchronization problems. > > > I am quite confused. Definitely need to read more about HDFS and other > distributed file systems. But before that, I would appreciate very much the > input from experts in the mailing list. http://hadoop.apache.org/common/docs/r0.20.0/hdfs_user_guide.html and http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html are good places to start. > > > Thanks a lot! > > Best regards, > Zhang Bingjun (Eddy) > > E-mail: [email protected], [email protected], [email protected] > Tel No: +65-96188110 (M) > Cheers, - Aaron
