[
https://issues.apache.org/jira/browse/HBASE-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423535#comment-13423535
]
Elliott Clark commented on HBASE-6461:
--------------------------------------
bq.Any chance to repeat this with distributed log splitting disabled?
Sure gimme a little bit. Reading the logs on my latest example. Once JD and I
are done (assuming that we don't solve it) I'll try with distributed log
splitting off.
On the latest run JD noticed that this in the logs of the server that was doing
the splitting (There was only one hlog nothing had rolled yet). 10.4.3.44 is
the server that had all java processes killed.
{noformat}
453-2012-07-26 22:13:18,079 WARN
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: File
hdfs://sv4r11s38:9901/hbase/.logs/sv4r3s44,9913,1343340207863-splitting/sv4r3s44%2C9913%2C1343340207863.1343340210132
might be still open, length is 0
454-2012-07-26 22:13:18,081 INFO org.apache.hadoop.hbase.util.FSHDFSUtils:
Recovering file
hdfs://sv4r11s38:9901/hbase/.logs/sv4r3s44,9913,1343340207863-splitting/sv4r3s44%2C9913%2C1343340207863.1343340210132
455-2012-07-26 22:13:19,083 INFO org.apache.hadoop.hbase.util.FSHDFSUtils:
Finished lease recover attempt for
hdfs://sv4r11s38:9901/hbase/.logs/sv4r3s44,9913,1343340207863-splitting/sv4r3s44%2C9913%2C1343340207863.1343340210132
456-2012-07-26 22:13:20,091 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /10.4.3.44:9910. Already tried 0 time(s).
457-2012-07-26 22:13:21,092 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /10.4.3.44:9910. Already tried 1 time(s).
458-2012-07-26 22:13:22,093 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /10.4.3.44:9910. Already tried 2 time(s).
459-2012-07-26 22:13:23,093 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /10.4.3.44:9910. Already tried 3 time(s).
460-2012-07-26 22:13:24,094 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /10.4.3.44:9910. Already tried 4 time(s).
461-2012-07-26 22:13:25,095 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /10.4.3.44:9910. Already tried 5 time(s).
462-2012-07-26 22:13:26,096 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /10.4.3.44:9910. Already tried 6 time(s).
463-2012-07-26 22:13:27,096 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /10.4.3.44:9910. Already tried 7 time(s).
464-2012-07-26 22:13:27,936 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=50.28 MB,
free=5.94 GB, max=5.99 GB, blocks=0, accesses=0, hits=0,
hitRatio=0cachingAccesses=0, cachingHits=0, cachingHitsRatio=0evictions=0,
evicted=0, evictedPerRun=NaN
465-2012-07-26 22:13:28,097 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /10.4.3.44:9910. Already tried 8 time(s).
466-2012-07-26 22:13:29,098 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /10.4.3.44:9910. Already tried 9 time(s).
467:2012-07-26 22:13:29,100 WARN
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Could not open
hdfs://sv4r11s38:9901/hbase/.logs/sv4r3s44,9913,1343340207863-splitting/sv4r3s44%2C9913%2C1343340207863.1343340210132
for reading. File is empty
468-java.io.EOFException
469- at java.io.DataInputStream.readFully(DataInputStream.java:180)
470- at java.io.DataInputStream.readFully(DataInputStream.java:152)
471- at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1521)
472- at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1493)
473- at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1480)
474- at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)
475- at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
476- at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:175)
477- at
org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:716)
478- at
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:821)
479- at
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:734)
480- at
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:381)
481- at
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:348)
482- at
org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:111)
483- at
org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:264)
484- at
org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:195)
485- at
org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:163)
486- at java.lang.Thread.run(Thread.java:662)
{noformat}
So the log splitter is trying the dead server over and over. Then failing to
read anything so root is empty.
> Killing the HRegionServer and DataNode hosting ROOT can result in a malformed
> root table.
> -----------------------------------------------------------------------------------------
>
> Key: HBASE-6461
> URL: https://issues.apache.org/jira/browse/HBASE-6461
> Project: HBase
> Issue Type: Bug
> Environment: hadoop-0.20.2-cdh3u3
> HBase 0.94.1 RC1
> Reporter: Elliott Clark
> Priority: Critical
> Fix For: 0.94.2
>
>
> Spun up a new dfs on hadoop-0.20.2-cdh3u3
> Started hbase
> started running loadtest tool.
> killed rs and dn holding root with killall -9 java on server sv4r27s44 at
> about 2012-07-25 22:40:00
> After things stabilize Root is in a bad state. Ran hbck and got:
> Exception in thread "main"
> org.apache.hadoop.hbase.client.NoServerForRegionException: No server address
> listed in -ROOT- for region .META.,,1.1028785192 containing row
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1016)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:841)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:810)
> at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:232)
> at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:172)
> at org.apache.hadoop.hbase.util.HBaseFsck.connect(HBaseFsck.java:241)
> at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:3236)
> hbase(main):001:0> scan '-ROOT-'
> ROW COLUMN+CELL
>
>
> 12/07/25 22:43:18 INFO security.UserGroupInformation: JAAS Configuration
> already set up for Hadoop, not re-installing.
> .META.,,1 column=info:regioninfo,
> timestamp=1343255838525, value={NAME => '.META.,,1', STARTKEY => '', ENDKEY
> => '', ENCODED => 1028785192,}
> .META.,,1 column=info:v,
> timestamp=1343255838525, value=\x00\x00
>
> 1 row(s) in 0.5930 seconds
> Here's the master log: https://gist.github.com/3179194
> I tried the same thing with 0.92.1 and I was able to get into a similar
> situation, so I don't think this is anything new.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira