[ 
https://issues.apache.org/jira/browse/HBASE-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835979#action_12835979
 ] 

Kannan Muthukkaruppan commented on HBASE-2235:
----------------------------------------------

I managed to get the .META. table inconsistent again in my small test cluster 
under load. The region server went down due to some errors from the HDFS 
layer... which we are separately following up on (probably just too much 
compaction, and stuff going on at the same time).

I know I can run the add_table to restore its sanity. But a few times now we 
have managed to get .META. inconsistent that it might make sense to do 
something about it in the 0.20.x timeframe.. (either make .META. updates atomic 
or have the meta scanner perhaps fix broken children).

So, roughly here is what happened today.

(i) A RS got a lot of 
org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException errors 
followed by:

{code}
2010-02-19 08:49:07,102 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery 
for block blk_9144926768183088527_186431 bad datanode[0] nodes == null
2010-02-19 08:49:07,102 WARN org.apache.hadoop.hdfs.DFSClient: Could not get 
block locations. Source file "/hbase-kannan1/test1/580635726/actions/133921297\
0969249937" - Aborting...
2010-02-19 08:49:07,117 FATAL 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog required. 
Forcing server shutdow
{code}

(ii) During shutdown there were other errors like:

{code}
2010-02-19 08:51:07,557 ERROR 
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split 
failed for region test1,1761194,1266576717079
java.io.IOException: Filesystem closed
2010-02-19 08:51:07,660 FATAL 
org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down 
HRegionServer: file system not available
java.io.IOException: File system is not available
...

2010-02-19 08:50:07,321 WARN 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Tried to hold up flushing 
for compactions of region test1,1761194,126657\
6717079 but have waited longer than 90000ms, continuing
2010-02-19 08:50:07,322 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT 
flushing memstore for region test1,1761194,1266576717079, flushing=false, w\
ritesEnabled=false
2010-02-19 08:50:07,348 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server 
Responder, call put([...@39804c99, 
[Lorg.apache.hadoop.hbase.client.Put;@1624ee4d)\
 from 10.131.1.186:36796: output error
2010-02-19 08:50:07,354 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server 
Responder, call put([...@5f3034b2, 
[Lorg.apache.hadoop.hbase.client.Put;@55d3c2f0)\
 from 10.131.1.186:36796: output error
2010-02-19 08:50:07,354 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 82 on 60020 caught: java.nio.channels.ClosedChannelException
        at 
sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
        at 
org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
        at 
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)
        at 
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
        at 
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)
{code}

After all this, when I restarted the RS. But several regions seem to be in odd 
state in .META. For example, for a particular startkey, I see all these entries:
{code}
test1,1204765,1266569946560 column=info:regioninfo, timestamp=1266581302018, 
value=REGION => {NAME => 'test1,
                             1204765,1266569946560', STARTKEY => '1204765', 
ENDKEY => '1441091', ENCODED => 18
                             19368969, OFFLINE => true, SPLIT => true, TABLE => 
{{NAME => 'test1', FAMILIES =>
                              [{NAME => 'actions', VERSIONS => '3', COMPRESSION 
=> 'NONE', TTL => '2147483647'
                             , BLOCKSIZE => '65536', IN_MEMORY => 'false', 
BLOCKCACHE => 'true'}]}}
 test1,1204765,1266569946560 column=info:server, timestamp=1266570029133, 
value=10.129.68.212:60020
 test1,1204765,1266569946560 column=info:serverstartcode, 
timestamp=1266570029133, value=1266562597546
 test1,1204765,1266569946560 column=info:splitB, timestamp=1266581302018, 
value=\x00\x071441091\x00\x00\x00\x0
                             
1\x26\xE6\x1F\xDF\x27\x1Btest1,1290703,1266581233447\x00\x071290703\x00\x00\x00\x
                             
05\x05test1\x00\x00\x00\x00\x00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x
                             
00\x00\x00\x07IS_META\x00\x00\x00\x05false\x00\x00\x00\x01\x07\x07actions\x00\x00
                             
\x00\x07\x00\x00\x00\x0BBLOOMFILTER\x00\x00\x00\x05false\x00\x00\x00\x0BCOMPRESSI
                             
ON\x00\x00\x00\x04NONE\x00\x00\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TT
                             
L\x00\x00\x00\x0A2147483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00
                             
\x00\x09IN_MEMORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04t
                             rueh\x0FQ\xCF
 test1,1204765,1266581233447 column=info:regioninfo, timestamp=1266609172177, 
value=REGION => {NAME => 'test1,
                             1204765,1266581233447', STARTKEY => '1204765', 
ENDKEY => '1290703', ENCODED => 13
                             73493090, OFFLINE => true, SPLIT => true, TABLE => 
{{NAME => 'test1', FAMILIES =>
                              [{NAME => 'actions', VERSIONS => '3', COMPRESSION 
=> 'NONE', TTL => '2147483647'
                             , BLOCKSIZE => '65536', IN_MEMORY => 'false', 
BLOCKCACHE => 'true'}]}}
 test1,1204765,1266581233447 column=info:server, timestamp=1266604768670, 
value=10.129.68.213:60020
 test1,1204765,1266581233447 column=info:serverstartcode, 
timestamp=1266604768670, value=1266562597511
 test1,1204765,1266581233447 column=info:splitA, timestamp=1266609172177, 
value=\x00\x071226169\x00\x00\x00\x0
                             
1\x26\xE7\xCA,\x7D\x1Btest1,1204765,1266609171581\x00\x071204765\x00\x00\x00\x05\
                             
x05test1\x00\x00\x00\x00\x00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\
                             
x00\x00\x07IS_META\x00\x00\x00\x05false\x00\x00\x00\x01\x07\x07actions\x00\x00\x0
                             
0\x07\x00\x00\x00\x0BBLOOMFILTER\x00\x00\x00\x05false\x00\x00\x00\x0BCOMPRESSION\
                             
x00\x00\x00\x04NONE\x00\x00\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x
                             
00\x00\x00\x0A2147483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x0
                             
0\x09IN_MEMORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true
                             \xB9\xBD\xFEO
 test1,1204765,1266581233447 column=info:splitB, timestamp=1266609172177, 
value=\x00\x071290703\x00\x00\x00\x0
                             
1\x26\xE7\xCA,\x7D\x1Btest1,1226169,1266609171581\x00\x071226169\x00\x00\x00\x05\
                             
x05test1\x00\x00\x00\x00\x00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\
                             
x00\x00\x07IS_META\x00\x00\x00\x05false\x00\x00\x00\x01\x07\x07actions\x00\x00\x0
                             
0\x07\x00\x00\x00\x0BBLOOMFILTER\x00\x00\x00\x05false\x00\x00\x00\x0BCOMPRESSION\
                             
x00\x00\x00\x04NONE\x00\x00\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x
                             
00\x00\x00\x0A2147483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x0
                             
0\x09IN_MEMORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true
                             \xE1\xDF\xF8p
 test1,1204765,1266609171581 column=info:regioninfo, timestamp=1266609172212, 
value=REGION => {NAME => 'test1,
                             1204765,1266609171581', STARTKEY => '1204765', 
ENDKEY => '1226169', ENCODED => 21
                             34878372, TABLE => {{NAME => 'test1', FAMILIES => 
[{NAME => 'actions', VERSIONS =
                             > '3', COMPRESSION => 'NONE', TTL => '2147483647', 
BLOCKSIZE => '65536', IN_MEMOR
                             Y => 'false', BLOCKCACHE => 'true'}]}}
{code} 

> Mechanism that would not have -ROOT- and .META. on same server caused failed 
> assign of .META.
> ---------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2235
>                 URL: https://issues.apache.org/jira/browse/HBASE-2235
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>             Fix For: 0.20.4, 0.21.0
>
>
> Here is the short story:
> Scenario is a cluster of 3 servers.  Server 1. crashed.  It was carrying the 
> .META.   We split the logs.  .META. is put on the head of the assignment 
> queue.  Server 2. happens to be in a state where it wants to report a split.  
> The master fails the report because there is no .META. (It fails it ugly with 
> a NPE).  Server 3. checks in and falls into the assignment code 
> (RegionManager#regionsAwaitingAssignment).  In here we have this bit of code 
> around line #412:
> {code}
>     if (reassigningMetas && isMetaOrRoot && !isSingleServer) {
>       return regionsToAssign; // dont assign anything to this server.
>     }
> {code}
> Because we think this not a single server cluster -- we think there are two 
> 'live' nodes -- we won't assign meta.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to