Re: HLog durabilty on the current and future Hadoop releases

Tatsuya Kawano Mon, 17 May 2010 15:58:21 -0700

Thanks Ryan! Now I got the whole picture. I'm going to go back to theJapanese Hadoop community and tell them what you and Todd have told mehere.


- Tatsuya


On May 18, 2010, at 7:19 AM, Ryan Rawson <ryano...@gmail.com> wrote:

2010/5/17 Tatsuya Kawano <tatsuya6...@gmail.com>:
Hi,

On 05/17/2010, at 11:50 PM, Todd Lipcon wrote:
2010/5/16 Tatsuya Kawano <tatsuya6...@gmail.com>
2. On Hadoop trunk, I'd prefer not to hflush() every single put,but relyon un-flushed replicas on HDFS nodes, so I can avoid theperformace penalty.Will this still durable? Will HMaster see un-flushed appendsright after a
region server failure?
If you don't call hflush(), you can still lose edits up to thelast block
boundary, since hflush is required to persist block locations to the
namenode.
hflush() does *not* sync to disk - it just makes sure that theedits are in
memory on all of the replicas.
I have some patches staged for CDH3 that will also make theperformance ofthis quite competitive by pipelining hflushes - basically it haslittle to
no effect on throughput, but only a few ms penalty on each write.
Thanks Todd. I thought hflush() does sync to disk and I was wrong.It seems the stuff you put on CDH3 is just the one I wanted!
Is your stuff already on the current CDH3 beta?



On 05/17/2010, at 2:22 PM, Ryan Rawson wrote:
2010/5/16 Tatsuya Kawano <tatsuya6...@gmail.com>:
1. On Hadoop 0.20.x (without HDFS-200 patch), I must close HLogto make it'sentries durable, right? While rolling HLog does this, how aboutregion
server failure?
The problem is that during failure how do you execute user code?  If
the JVM segfaults hard, we have no opportunity to execute Java code.
Thanks Ryan. That's right. And OS can crash by a hardware failure(memory, cpu) and network can be disconnected at anytime. In thosecases, we don't have any opportunity to execute Java code.
Is there anything the data node can do after detecting clienttimeout?
Things arent quite that simple, the core bug that HDFS-200/265
attempts to fix is not discarding "blocks under construction".  Right
now in a HDFS stock, if a file isnt closed, the half-completed blocks
are just tossed.  Retaining them and managing them is some what
HDFS-200/265 is about.

Besides just managing under-construction blocks, there is a client
side flush which pushes data to the server and waits for ACKs before
returning.  Also there is other client code which when reading from a
half written file figuring out what to do - normally in a completed
file all 3 replicas are the same length, but in a crash scenario this
isnt assured, and you don't just want to give up, you want to recover
as much data as possible.
And how much edits I could lose? If a log-roll never happens, is itgoing to be up to dfs.block.size (64MB by default)?
Right now?   You lose the most recent HLog.  By default they are
rotated at 64MB, but you can set it lower to achieve a lower failure
window.  Also be warned that a smaller log means more HLog files which
translates into more memstore flushes and generally puts more strain
on the DFS, thus for really high-update (in terms of bytes/per time)
set-ups this can cause issues.

We have our prod set up like so:
   <name>hbase.regionserver.hlog.blocksize</name>
   <value>2097152</value>

-ryan

Re: HLog durabilty on the current and future Hadoop releases

Reply via email to