[ https://issues.apache.org/jira/browse/HBASE-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835979#action_12835979 ]
Kannan Muthukkaruppan commented on HBASE-2235: ---------------------------------------------- I managed to get the .META. table inconsistent again in my small test cluster under load. The region server went down due to some errors from the HDFS layer... which we are separately following up on (probably just too much compaction, and stuff going on at the same time). I know I can run the add_table to restore its sanity. But a few times now we have managed to get .META. inconsistent that it might make sense to do something about it in the 0.20.x timeframe.. (either make .META. updates atomic or have the meta scanner perhaps fix broken children). So, roughly here is what happened today. (i) A RS got a lot of org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException errors followed by: {code} 2010-02-19 08:49:07,102 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_9144926768183088527_186431 bad datanode[0] nodes == null 2010-02-19 08:49:07,102 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/hbase-kannan1/test1/580635726/actions/133921297\ 0969249937" - Aborting... 2010-02-19 08:49:07,117 FATAL org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog required. Forcing server shutdow {code} (ii) During shutdown there were other errors like: {code} 2010-02-19 08:51:07,557 ERROR org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split failed for region test1,1761194,1266576717079 java.io.IOException: Filesystem closed 2010-02-19 08:51:07,660 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down HRegionServer: file system not available java.io.IOException: File system is not available ... 2010-02-19 08:50:07,321 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Tried to hold up flushing for compactions of region test1,1761194,126657\ 6717079 but have waited longer than 90000ms, continuing 2010-02-19 08:50:07,322 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region test1,1761194,1266576717079, flushing=false, w\ ritesEnabled=false 2010-02-19 08:50:07,348 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call put([...@39804c99, [Lorg.apache.hadoop.hbase.client.Put;@1624ee4d)\ from 10.131.1.186:36796: output error 2010-02-19 08:50:07,354 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call put([...@5f3034b2, [Lorg.apache.hadoop.hbase.client.Put;@55d3c2f0)\ from 10.131.1.186:36796: output error 2010-02-19 08:50:07,354 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 82 on 60020 caught: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125) at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615) at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943) {code} After all this, when I restarted the RS. But several regions seem to be in odd state in .META. For example, for a particular startkey, I see all these entries: {code} test1,1204765,1266569946560 column=info:regioninfo, timestamp=1266581302018, value=REGION => {NAME => 'test1, 1204765,1266569946560', STARTKEY => '1204765', ENDKEY => '1441091', ENCODED => 18 19368969, OFFLINE => true, SPLIT => true, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'actions', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647' , BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}} test1,1204765,1266569946560 column=info:server, timestamp=1266570029133, value=10.129.68.212:60020 test1,1204765,1266569946560 column=info:serverstartcode, timestamp=1266570029133, value=1266562597546 test1,1204765,1266569946560 column=info:splitB, timestamp=1266581302018, value=\x00\x071441091\x00\x00\x00\x0 1\x26\xE6\x1F\xDF\x27\x1Btest1,1290703,1266581233447\x00\x071290703\x00\x00\x00\x 05\x05test1\x00\x00\x00\x00\x00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x 00\x00\x00\x07IS_META\x00\x00\x00\x05false\x00\x00\x00\x01\x07\x07actions\x00\x00 \x00\x07\x00\x00\x00\x0BBLOOMFILTER\x00\x00\x00\x05false\x00\x00\x00\x0BCOMPRESSI ON\x00\x00\x00\x04NONE\x00\x00\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TT L\x00\x00\x00\x0A2147483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00 \x00\x09IN_MEMORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04t rueh\x0FQ\xCF test1,1204765,1266581233447 column=info:regioninfo, timestamp=1266609172177, value=REGION => {NAME => 'test1, 1204765,1266581233447', STARTKEY => '1204765', ENDKEY => '1290703', ENCODED => 13 73493090, OFFLINE => true, SPLIT => true, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'actions', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647' , BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}} test1,1204765,1266581233447 column=info:server, timestamp=1266604768670, value=10.129.68.213:60020 test1,1204765,1266581233447 column=info:serverstartcode, timestamp=1266604768670, value=1266562597511 test1,1204765,1266581233447 column=info:splitA, timestamp=1266609172177, value=\x00\x071226169\x00\x00\x00\x0 1\x26\xE7\xCA,\x7D\x1Btest1,1204765,1266609171581\x00\x071204765\x00\x00\x00\x05\ x05test1\x00\x00\x00\x00\x00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\ x00\x00\x07IS_META\x00\x00\x00\x05false\x00\x00\x00\x01\x07\x07actions\x00\x00\x0 0\x07\x00\x00\x00\x0BBLOOMFILTER\x00\x00\x00\x05false\x00\x00\x00\x0BCOMPRESSION\ x00\x00\x00\x04NONE\x00\x00\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x 00\x00\x00\x0A2147483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x0 0\x09IN_MEMORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true \xB9\xBD\xFEO test1,1204765,1266581233447 column=info:splitB, timestamp=1266609172177, value=\x00\x071290703\x00\x00\x00\x0 1\x26\xE7\xCA,\x7D\x1Btest1,1226169,1266609171581\x00\x071226169\x00\x00\x00\x05\ x05test1\x00\x00\x00\x00\x00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\ x00\x00\x07IS_META\x00\x00\x00\x05false\x00\x00\x00\x01\x07\x07actions\x00\x00\x0 0\x07\x00\x00\x00\x0BBLOOMFILTER\x00\x00\x00\x05false\x00\x00\x00\x0BCOMPRESSION\ x00\x00\x00\x04NONE\x00\x00\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x 00\x00\x00\x0A2147483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x0 0\x09IN_MEMORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true \xE1\xDF\xF8p test1,1204765,1266609171581 column=info:regioninfo, timestamp=1266609172212, value=REGION => {NAME => 'test1, 1204765,1266609171581', STARTKEY => '1204765', ENDKEY => '1226169', ENCODED => 21 34878372, TABLE => {{NAME => 'test1', FAMILIES => [{NAME => 'actions', VERSIONS = > '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR Y => 'false', BLOCKCACHE => 'true'}]}} {code} > Mechanism that would not have -ROOT- and .META. on same server caused failed > assign of .META. > --------------------------------------------------------------------------------------------- > > Key: HBASE-2235 > URL: https://issues.apache.org/jira/browse/HBASE-2235 > Project: Hadoop HBase > Issue Type: Bug > Reporter: stack > Fix For: 0.20.4, 0.21.0 > > > Here is the short story: > Scenario is a cluster of 3 servers. Server 1. crashed. It was carrying the > .META. We split the logs. .META. is put on the head of the assignment > queue. Server 2. happens to be in a state where it wants to report a split. > The master fails the report because there is no .META. (It fails it ugly with > a NPE). Server 3. checks in and falls into the assignment code > (RegionManager#regionsAwaitingAssignment). In here we have this bit of code > around line #412: > {code} > if (reassigningMetas && isMetaOrRoot && !isSingleServer) { > return regionsToAssign; // dont assign anything to this server. > } > {code} > Because we think this not a single server cluster -- we think there are two > 'live' nodes -- we won't assign meta. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.