2010/5/17 Tatsuya Kawano <tatsuya6...@gmail.com>: > > Hi, > > On 05/17/2010, at 11:50 PM, Todd Lipcon wrote: > >> 2010/5/16 Tatsuya Kawano <tatsuya6...@gmail.com> >> >>> 2. On Hadoop trunk, I'd prefer not to hflush() every single put, but rely >>> on un-flushed replicas on HDFS nodes, so I can avoid the performace penalty. >>> Will this still durable? Will HMaster see un-flushed appends right after a >>> region server failure? >>> >>> >> If you don't call hflush(), you can still lose edits up to the last block >> boundary, since hflush is required to persist block locations to the >> namenode. >> >> hflush() does *not* sync to disk - it just makes sure that the edits are in >> memory on all of the replicas. >> >> I have some patches staged for CDH3 that will also make the performance of >> this quite competitive by pipelining hflushes - basically it has little to >> no effect on throughput, but only a few ms penalty on each write. > > > Thanks Todd. I thought hflush() does sync to disk and I was wrong. It seems > the stuff you put on CDH3 is just the one I wanted! > > Is your stuff already on the current CDH3 beta? > > > > On 05/17/2010, at 2:22 PM, Ryan Rawson wrote: > >> 2010/5/16 Tatsuya Kawano <tatsuya6...@gmail.com>: >>> 1. On Hadoop 0.20.x (without HDFS-200 patch), I must close HLog to make it's >>> entries durable, right? While rolling HLog does this, how about region >>> server failure? >> >> The problem is that during failure how do you execute user code? If >> the JVM segfaults hard, we have no opportunity to execute Java code. > > Thanks Ryan. That's right. And OS can crash by a hardware failure (memory, > cpu) and network can be disconnected at anytime. In those cases, we don't > have any opportunity to execute Java code. > > Is there anything the data node can do after detecting client timeout?
Things arent quite that simple, the core bug that HDFS-200/265 attempts to fix is not discarding "blocks under construction". Right now in a HDFS stock, if a file isnt closed, the half-completed blocks are just tossed. Retaining them and managing them is some what HDFS-200/265 is about. Besides just managing under-construction blocks, there is a client side flush which pushes data to the server and waits for ACKs before returning. Also there is other client code which when reading from a half written file figuring out what to do - normally in a completed file all 3 replicas are the same length, but in a crash scenario this isnt assured, and you don't just want to give up, you want to recover as much data as possible. > And how much edits I could lose? If a log-roll never happens, is it going to > be up to dfs.block.size (64MB by default)? Right now? You lose the most recent HLog. By default they are rotated at 64MB, but you can set it lower to achieve a lower failure window. Also be warned that a smaller log means more HLog files which translates into more memstore flushes and generally puts more strain on the DFS, thus for really high-update (in terms of bytes/per time) set-ups this can cause issues. We have our prod set up like so: <name>hbase.regionserver.hlog.blocksize</name> <value>2097152</value> -ryan