Thanks Ryan! Now I got the whole picture. I'm going to go back to the
Japanese Hadoop community and tell them what you and Todd have told me
here.
- Tatsuya
On May 18, 2010, at 7:19 AM, Ryan Rawson <ryano...@gmail.com> wrote:
2010/5/17 Tatsuya Kawano <tatsuya6...@gmail.com>:
Hi,
On 05/17/2010, at 11:50 PM, Todd Lipcon wrote:
2010/5/16 Tatsuya Kawano <tatsuya6...@gmail.com>
2. On Hadoop trunk, I'd prefer not to hflush() every single put,
but rely
on un-flushed replicas on HDFS nodes, so I can avoid the
performace penalty.
Will this still durable? Will HMaster see un-flushed appends
right after a
region server failure?
If you don't call hflush(), you can still lose edits up to the
last block
boundary, since hflush is required to persist block locations to the
namenode.
hflush() does *not* sync to disk - it just makes sure that the
edits are in
memory on all of the replicas.
I have some patches staged for CDH3 that will also make the
performance of
this quite competitive by pipelining hflushes - basically it has
little to
no effect on throughput, but only a few ms penalty on each write.
Thanks Todd. I thought hflush() does sync to disk and I was wrong.
It seems the stuff you put on CDH3 is just the one I wanted!
Is your stuff already on the current CDH3 beta?
On 05/17/2010, at 2:22 PM, Ryan Rawson wrote:
2010/5/16 Tatsuya Kawano <tatsuya6...@gmail.com>:
1. On Hadoop 0.20.x (without HDFS-200 patch), I must close HLog
to make it's
entries durable, right? While rolling HLog does this, how about
region
server failure?
The problem is that during failure how do you execute user code? If
the JVM segfaults hard, we have no opportunity to execute Java code.
Thanks Ryan. That's right. And OS can crash by a hardware failure
(memory, cpu) and network can be disconnected at anytime. In those
cases, we don't have any opportunity to execute Java code.
Is there anything the data node can do after detecting client
timeout?
Things arent quite that simple, the core bug that HDFS-200/265
attempts to fix is not discarding "blocks under construction". Right
now in a HDFS stock, if a file isnt closed, the half-completed blocks
are just tossed. Retaining them and managing them is some what
HDFS-200/265 is about.
Besides just managing under-construction blocks, there is a client
side flush which pushes data to the server and waits for ACKs before
returning. Also there is other client code which when reading from a
half written file figuring out what to do - normally in a completed
file all 3 replicas are the same length, but in a crash scenario this
isnt assured, and you don't just want to give up, you want to recover
as much data as possible.
And how much edits I could lose? If a log-roll never happens, is it
going to be up to dfs.block.size (64MB by default)?
Right now? You lose the most recent HLog. By default they are
rotated at 64MB, but you can set it lower to achieve a lower failure
window. Also be warned that a smaller log means more HLog files which
translates into more memstore flushes and generally puts more strain
on the DFS, thus for really high-update (in terms of bytes/per time)
set-ups this can cause issues.
We have our prod set up like so:
<name>hbase.regionserver.hlog.blocksize</name>
<value>2097152</value>
-ryan