On Thu, Nov 11, 2010 at 1:26 PM, Thanh Do <than...@cs.wisc.edu> wrote:
> Got it! > > Currently, the model is single writer/multiple reader. > In the GFS paper, i see they have *record append* > semantics, that is allow multiple clients writing to the > same file. Do you guys have any plan to implement > this... > > Not that I'm aware of - as a community project I can't speak for everyone else, though :) It's interesting to note that the GFS designers are on record in an ACM Queue interview[1] saying that this feature was a mistake. It was too hard to implement correctly and it has some really strange semantics that users found difficult to understand (eg different replicas of a block could contain records in different orders!) [1] http://queue.acm.org/detail.cfm?id=1594206 Todd > On Thu, Nov 11, 2010 at 3:10 PM, Todd Lipcon <t...@cloudera.com> wrote: > > > On Thu, Nov 11, 2010 at 12:43 PM, Thanh Do <than...@cs.wisc.edu> wrote: > > > > > Thank you all for clarification guys. > > > I also looked at 0.20-append trunk and see that the order is totally > > > different. > > > > > > One more thing, do you guys plan to implement hsync(), i.e API3 > > > in the near future. Are there any class of application that requires > such > > > strong guarantee? > > > > > > > > I don't personally have any plans - everyone I've talked to who cares > about > > data durability is OK with potential file truncation if power is lost > > across > > all DNs simultaneously. > > > > I'm sure there are some applications where this isn't acceptable, but > > people > > aren't using HBase for those applications yet :) > > > > -Todd > > > > > > > > On Thu, Nov 11, 2010 at 2:27 PM, Todd Lipcon <t...@cloudera.com> > wrote: > > > > > > > On Thu, Nov 11, 2010 at 11:55 AM, Hairong Kuang < > > kuang.hair...@gmail.com > > > > >wrote: > > > > > > > > > A few clarification on API2 semantics. > > > > > > > > > > 1. Ack gets sent back to the client before a packet gets written to > > > local > > > > > files. > > > > > > > > > > > > > Ah, I see in trunk this is the case. In 0.20-append, it's the other > way > > > > around - we only enqueue after flush. > > > > > > > > > > > > > 2. Data become visible to new readers on the condition that at > least > > > one > > > > > DataNode does not have an error. > > > > > 3. The reason that flush is done after a write is more for the > > purpose > > > of > > > > > implementation simplification. Currently readers do not read from > > > > DataNode > > > > > buffer. They only read from system buffer. A flush makes the data > > > visible > > > > > to > > > > > readers sooner. > > > > > > > > > > Hairong > > > > > > > > > > On 11/11/10 7:31 AM, "Thanh Do" <than...@cs.wisc.edu> wrote: > > > > > > > > > > > Thanks Todd, > > > > > > > > > > > > In HDFS-6313, i see three API (sync, hflush, hsync), > > > > > > And I assume hflush corresponds to : > > > > > > > > > > > > *"API2: flushes out to all replicas of the block. > > > > > > The data is in the buffers of the DNs but not on the DN's OS > > buffers. > > > > > > New readers will see the data after the call has returned.*" > > > > > > > > > > > > I am still confused that, once the client calls hflushes, > > > > > > the client gonna wait for all outstanding packet to be acked, > > > > > > before sending subsequent packet. > > > > > > But at DataNode, it is possible that the ack to client is sent > > > > > > before the write of data and checksum to replica. So if > > > > > > the DataNode crashes just after sending the ack and before > > > > > > writing to replica, will the semantics be violated here? > > > > > > > > > > > > Thanks > > > > > > Thanh > > > > > > > > > > > > On Wed, Nov 10, 2010 at 11:11 PM, Todd Lipcon <t...@cloudera.com > > > > > > wrote: > > > > > > > > > > > >> Nope, flush just flushes the java side buffer to the Linux > buffer > > > > > >> cache -- not all the way to the media. > > > > > >> > > > > > >> Hsync is the API that will eventually go all the way to disk, > but > > it > > > > > >> has not yet been implemented. > > > > > >> > > > > > >> -Todd > > > > > >> > > > > > >> On Wednesday, November 10, 2010, Thanh Do <than...@cs.wisc.edu> > > > > wrote: > > > > > >>> Or another way to rephase my question: > > > > > >>> does data.flush and checksumOut.flush guarantee > > > > > >>> data be synchronized with underlying disk, > > > > > >>> just like fsync(). > > > > > >>> > > > > > >>> Thanks > > > > > >>> Thanh > > > > > >>> > > > > > >>> On Wed, Nov 10, 2010 at 10:26 PM, Thanh Do < > than...@cs.wisc.edu> > > > > > wrote: > > > > > >>> > > > > > >>>> Hi all, > > > > > >>>> > > > > > >>>> After reading the appenddesign3.pdf in HDFS-256, > > > > > >>>> and looking at the BlockReceiver.java code in 0.21.0, > > > > > >>>> I am confused by the following. > > > > > >>>> > > > > > >>>> The document says that: > > > > > >>>> *For each packet, a DataNode in the pipeline has to do 3 > things. > > > > > >>>> 1. Stream data > > > > > >>>> a. Receive data from the upstream DataNode or the client > > > > > >>>> b. Push the data to the downstream DataNode if there is > > any > > > > > >>>> 2. Write the data/crc to its block file/meta file. > > > > > >>>> 3. Stream ack > > > > > >>>> a. Receive an ack from the downstream DataNode if there > is > > > any > > > > > >>>> b. Send an ack to the upstream DataNode or the client* > > > > > >>>> > > > > > >>>> And *"...there is no guarantee on the order of (2) and (3)"* > > > > > >>>> > > > > > >>>> In BlockReceiver.receivePacket(), after read the packet > buffer, > > > > > >>>> datanode does: > > > > > >>>> 1) put the packet seqno in the ack queue > > > > > >>>> 2) write data and checksum to disk > > > > > >>>> 3) flush data and checksum (to disk) > > > > > >>>> > > > > > >>>> The thing that confusing me is that: the streaming of ack does > > not > > > > > >>>> necessary depends on whether data has been flush to disk or > not. > > > > > >>>> Then, my question is: > > > > > >>>> Why do DataNode need to flush data and checksum > > > > > >>>> every time the DataNode receives a packet. This flush may be > > > costly. > > > > > >>>> Why cant the DataNode just batch server write (after receiving > > > > > >>>> server packet) and flush all at once? > > > > > >>>> Is there any particular reason for doing so? > > > > > >>>> > > > > > >>>> Can somebody clarify this for me? > > > > > >>>> > > > > > >>>> Thanks so much. > > > > > >>>> Thanh > > > > > >>>> > > > > > >>>> > > > > > >>>> > > > > > >>>> > > > > > >>> > > > > > >> > > > > > >> -- > > > > > >> Todd Lipcon > > > > > >> Software Engineer, Cloudera > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Todd Lipcon > > > > Software Engineer, Cloudera > > > > > > > > > > > > > > > -- > > Todd Lipcon > > Software Engineer, Cloudera > > > -- Todd Lipcon Software Engineer, Cloudera