Re: HLog durabilty on the current and future Hadoop releases

Ryan Rawson Mon, 17 May 2010 15:19:36 -0700

2010/5/17 Tatsuya Kawano <tatsuya6...@gmail.com>:
>
> Hi,
>
> On 05/17/2010, at 11:50 PM, Todd Lipcon wrote:
>
>> 2010/5/16 Tatsuya Kawano <tatsuya6...@gmail.com>
>>
>>> 2. On Hadoop trunk, I'd prefer not to hflush() every single put, but rely
>>> on un-flushed replicas on HDFS nodes, so I can avoid the performace penalty.
>>> Will this still durable? Will HMaster see un-flushed appends right after a
>>> region server failure?
>>>
>>>
>> If you don't call hflush(), you can still lose edits up to the last block
>> boundary, since hflush is required to persist block locations to the
>> namenode.
>>
>> hflush() does *not* sync to disk - it just makes sure that the edits are in
>> memory on all of the replicas.
>>
>> I have some patches staged for CDH3 that will also make the performance of
>> this quite competitive by pipelining hflushes - basically it has little to
>> no effect on throughput, but only a few ms penalty on each write.
>
>
> Thanks Todd. I thought hflush() does sync to disk and I was wrong. It seems 
> the stuff you put on CDH3 is just the one I wanted!
>
> Is your stuff already on the current CDH3 beta?
>
>
>
> On 05/17/2010, at 2:22 PM, Ryan Rawson wrote:
>
>> 2010/5/16 Tatsuya Kawano <tatsuya6...@gmail.com>:
>>> 1. On Hadoop 0.20.x (without HDFS-200 patch), I must close HLog to make it's
>>> entries durable, right? While rolling HLog does this, how about region
>>> server failure?
>>
>> The problem is that during failure how do you execute user code?  If
>> the JVM segfaults hard, we have no opportunity to execute Java code.
>
> Thanks Ryan. That's right. And OS can crash by a hardware failure (memory, 
> cpu) and network can be disconnected at anytime. In those cases, we don't 
> have any opportunity to execute Java code.
>
> Is there anything the data node can do after detecting client timeout?


Things arent quite that simple, the core bug that HDFS-200/265
attempts to fix is not discarding "blocks under construction".  Right
now in a HDFS stock, if a file isnt closed, the half-completed blocks
are just tossed.  Retaining them and managing them is some what
HDFS-200/265 is about.

Besides just managing under-construction blocks, there is a client
side flush which pushes data to the server and waits for ACKs before
returning.  Also there is other client code which when reading from a
half written file figuring out what to do - normally in a completed
file all 3 replicas are the same length, but in a crash scenario this
isnt assured, and you don't just want to give up, you want to recover
as much data as possible.

> And how much edits I could lose? If a log-roll never happens, is it going to 
> be up to dfs.block.size (64MB by default)?

Right now?   You lose the most recent HLog.  By default they are
rotated at 64MB, but you can set it lower to achieve a lower failure
window.  Also be warned that a smaller log means more HLog files which
translates into more memstore flushes and generally puts more strain
on the DFS, thus for really high-update (in terms of bytes/per time)
set-ups this can cause issues.

We have our prod set up like so:
    <name>hbase.regionserver.hlog.blocksize</name>
    <value>2097152</value>

-ryan

Re: HLog durabilty on the current and future Hadoop releases

Reply via email to