We just found ourselves in an interesting pickle.

We were upgrading one of our clusters from HBase 0.94.0 on Hadoop 1.0.4 to 
HBase 0.94.4 on top of Hadoop 2.
The cluster has been setup a while ago and the old shutdown script had a bug 
and shutdown HBase and HDFS uncleanly.

Assuming that the log will be replayed we upgraded Hadoop to 2.0.x, and 
verified that from a file system view everything is OK.
The new HDFS runs with an HA NameNode, so the FS changed from hdfs://<old host 
name> to hdfs://<ha cluster name>


Then we brought up HBase and found it stuck in splitting logs forever.
In the log we see messages like these:
2013-02-05 06:22:31,045 ERROR 
org.apache.hadoop.hbase.regionserver.SplitLogWorker: unexpected error
java.lang.IllegalArgumentException:
 Wrong FS: 
hdfs://<old NN host>/.logs/<rs host>,60020,1358540589323-splitting/<rs 
host>%2C60020%2C1358540589323.1359962644861,
 expected: hdfs://<ha cluster name>
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:547)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:169)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:783)
        at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:111)
        at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:264)
        at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:195)
        at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:163)
        at java.lang.Thread.run(Thread.java:662)

So it looks like distributed log splitting stores the full HDFS path name 
including the host, which seems unnecessary.
This path is stored in ZK.

So all in all it seems that only can happen if all the following is true: 
unclean shutdown, keeping the same ZK ensemble, changed FS.


The data is not important, we can just blow it away, but we want to prove that 
we could recover the data if we had to.
It seems we have three options:

1. Blow away the data in ZK under "splitlog", and restart HBase. It should 
restart the split process with the correct pathnames.

2. Temporarily change the config for the region server to set the root dir to 
hdfs://<old NN host>, bounce HBase. The log splitting should now be able to 
succeed.
3. Downgrade back to the old Hadoop (we kept a copy of the image).

We're trying option #2, to see whether that would fix it. #1 should work too.


Has anybody else experienced this?
It seems that would also limit our ability to take a snapshot of a filesystem 
and move it to somewhere else, as the hostnames are hardcoded, at least in ZK 
for log splitting.


-- Lars

Reply via email to