[ 
https://issues.apache.org/jira/browse/HBASE-10312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated HBASE-10312:
---------------------------------------

    Attachment: HBASE-10312.java

While running {{TestHRegion}}, I saw that it's failing about 50% of the time on 
{{testgetHDFSBlocksDistribution}} because the mini cluster shuts down while 
it's being initialized. Digging led me to find that {{testWritesWhileGetting}} 
is flushing like mad and completely swamps {{TaskMonitor}}, so much that 
{{purgeExpiredTasks}} could block for seconds on sublisting. This blocking is 
preventing the region servers from starting their RPC server fast enough and in 
the mean time the master gives up on trying to assign meta (WTF!) and then it 
just sits there doing nothing until the {{HMaster}} creation times out. And 
this is why the cluster is shutting down when trying to boot up.

The patch I'm attaching makes {{TestHRegion}} work 100% of the time by using a 
{{CircularFifoBuffer}} ([~stack]'s idea). I'm positive that it also fixes your 
issue, [~apurtell].

> Flooding the cluster with administrative actions leads to collapse
> ------------------------------------------------------------------
>
>                 Key: HBASE-10312
>                 URL: https://issues.apache.org/jira/browse/HBASE-10312
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Andrew Purtell
>             Fix For: 0.99.0
>
>         Attachments: HBASE-10312.java
>
>
> Steps to reproduce:
> 1. Start a cluster.
> 2. Start an ingest process.
> 3. In the HBase shell, do this:
> {noformat}
> while true do
>    flush 'table'
> end
> {noformat}
> We should reject abuse via administrative requests like this.
> What happens on the cluster is the requests back up, leading to lots of these:
> {noformat}
> 2014-01-10 18:55:55,293 WARN  [Priority.RpcServer.handler=2,port=8120] 
> monitoring.TaskMonitor: Too many actions in action monitor! Purging some.
> {noformat}
> At this point we could lower a gate on further requests for actions until the 
> backlog clears.
> Continuing, all of the regionservers will eventually die with a 
> StackOverflowError of unknown origin because, stack overflow:
> {noformat}
> 2014-01-10 19:02:02,783 ERROR [Priority.RpcServer.handler=3,port=8120] 
> ipc.RpcServer: Unexpected throwable object java.lang.StackOverflowError
>         at java.util.ArrayList$SubList.add(ArrayList.java:965)
> [...]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to