A few of us went to a Hadoop Committers Meeting kindly hosted by Yahoo! yesterday. HBase was represented by Chad Walters, Jim Kellerman, Ryan Rawson, and myself. The rest of the meeting attendees were a bunch of the Y! HDFS team, plus meeting leader MapReducer Owen O'Malley, along with Facebookees (Dhruba, Ashish, etc.) and Luke Liu of HyperTable/Zvents.
The meeting topic was append/flush/sync in HDFS. After some back and forth over a set of slides presented by Sanjay on work being done by Hairong as part of HADOOP-5744, "Revising append", the room settled on API3 from the list of options below as the priority feature needed by HADOOP 0.21.0. Readers must be able to read up to the last writer 'successful' flush. Its not important that the file length is 'inexact'. Hairong's revisit work builds on the work done in HADOOP-4379, etc., but is a different effort. It was presented that the latest HADOOP-4379 patch works pretty good and that its a million times better than nothing though there is some lag while lease is recovered (Hairong and Dhruba chatting think that the cycle waiting on a successful append so we can then close, and then open to read may not actually be necessary -- will update HADOOP-4379 after trying it out). Dhruba notes HADOOP-4379 is not enough. HADOOP-4663 is also needed. We need to test but in discussion, a patched HADOOP 0.20.0 with a working flush may be possible. Before the above meeting, a few of us met with the Y! HDFS team to chat. On DFSClient recovery, while in the room, Raghu may have fingered our problem: HADOOP-5903. On xceiver count, because TRUNK uses pread in HDFS, the number of occupied threads in datanodes may actually be much lower since pread opens socket, reads and then closes the socket. We need to test. On occasional slow writes into HDFS, we need to check see what the datanode is doing at the time. St.Ack Below are options presented by Sanjay: > Below is a list of APIs/semantics variations we are considering. > Which ones do you absolutely needed for HBase in the short term and > which ones may be useful to HBase in the longer term. > > API1: flushes out from the address space of client into the socket to the > data nodes. > > On the return of the call there is no guarantee that that data is > out of the underlying node and no guarantee of having reached a > DN. Readers will see this data soon if there are no failures. > > For example, I suspect Scribe and chukwa will like the lower > latency of this API and are prepared to loose some records > occasionally in case of failures. Clearly a journal will not find > this api acceptable. > > API2: flushes out to at lease one data node and receives an ack. > > New readers will eventually see the data > > API3: flushes out to all replicas of the block. The data is in the buffers of > the DNs but not on the DN's OS buffers > > New readers will see the data after the call has returned. (Hadoop > 5744 calls API3 hflush for now). > > API4: flushes out to all replicas and all replicas DNs have done a posix > fflush equivalent - ie data is out the under lying OS file system of the DNs > > API5: flushes out to all replicas and all repliacs have done posix fsync > equivalent - ie the OS has flushed it to the disk device (but the disk may > have it in its cache).