[jira] Commented: (HBASE-497) RegionServer needs to recover if datanode goes down

Bryan Duxbury (JIRA) Fri, 07 Mar 2008 18:05:09 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576469#action_12576469
 ]


Bryan Duxbury commented on HBASE-497:
-------------------------------------

At least in 0.1, where we don't have appends to logs, when we get an error 
trying to append to a log, there's no real way to recover the lost log data. 
This is because HDFS files don't exist until they're closed. (See HADOOP-1700)

Our options are:
 * Bail the regionserver. There's been an exception we shouldn't really ever 
get, and it's bad. Let it get worked out by restarting.
 * Bail the regionserver, but also try to flush the caches first. This has the 
advantage of saving the data already written to caches, if possible. Might end 
up with a convoluted flow to make it happen.
 * Open a new log like nothing ever happened. We'll have lost the updates since 
the last log roll, but who cares, since there's nothing we can do to recover 
it, period.
 * Change logging to log to a local file as well as the HDFS file. Then, if 
there's an exception at any point writing to the HDFS log, we can copy the 
local version of the log up to HDFS and keep appending. This gives us some 
resilience to datanode failures, but doesn't really make our logs any more 
useful in the case of dying machines or network partitions. It's also a lot of 
new functionality, which doesn't exactly fit with the goals of 0.1 (bugfixes 
only). 

Of these options, I think the best one is to just open a new log. This will 
keep our regionserver online and let us carry on with the minimum of 
difficulty. Does this seem like enough of a fix to satisfy the 0.1 release 
block?

> RegionServer needs to recover if datanode goes down
> ---------------------------------------------------
>
>                 Key: HBASE-497
>                 URL: https://issues.apache.org/jira/browse/HBASE-497
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.16.0
>            Reporter: Michael Bieniosek
>            Priority: Blocker
>             Fix For: 0.1.0, 0.2.0
>
>
> If I take down a datanode, the regionserver will repeatedly return this error:
> java.io.IOException: Stream closed.
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.isClosed(DFSClient.java:1875)
>         at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:2096)
>         at 
> org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:141)
>         at 
> org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:124)
>         at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:112)
>         at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:86)
>         at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:41)
>         at java.io.DataOutputStream.write(Unknown Source)
>         at 
> org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:977)
>         at org.apache.hadoop.hbase.HLog.append(HLog.java:377)
>         at org.apache.hadoop.hbase.HRegion.update(HRegion.java:1455)
>         at org.apache.hadoop.hbase.HRegion.batchUpdate(HRegion.java:1259)
>         at 
> org.apache.hadoop.hbase.HRegionServer.batchUpdate(HRegionServer.java:1433)
>         at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>         at java.lang.reflect.Method.invoke(Unknown Source)
>         at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:413)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:910)
> It appears that hbase/dfsclient does not attempt to reopen the stream.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-497) RegionServer needs to recover if datanode goes down

Reply via email to