[ 
https://issues.apache.org/jira/browse/HBASE-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657660#action_12657660
 ] 

Andrew Purtell commented on HBASE-1062:
---------------------------------------

> Is it wise postponing memcache flushes? 

I thought safe mode should be essentially "don't touch DFS". 

> We schedule compactions on open and on flush. This would put off the open 
> scheduling
> for interval of 2 minutes. If cluster went down ugly, and some regions had 
> References
> outstanding, then these regions would not be splittable

Wouldn't the references be cleared when the deferred compactions finally are 
allowed to run? Then the split would happen. This is what I observe while 
testing. 

> Do we ever break out of this loop [...] Looks like we increment count then 
> set it to zero
> after sleep. It never progresses?

The code in question just sleeps (once) during the CompactSplitThread main loop 
if count becomes greater than limit, then count is reset.

It looks like I still need to be more aggressive with making the compact/split 
ramp-up a longer slope, at least given our cluster and circumstances. The 
current patch helps but we can still overwhelm DFS sometimes after a restart. 


> Compactions at (re)start on a large table can overwhelm DFS
> -----------------------------------------------------------
>
>                 Key: HBASE-1062
>                 URL: https://issues.apache.org/jira/browse/HBASE-1062
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: regionserver
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>            Priority: Critical
>             Fix For: 0.20.0
>
>         Attachments: 1062-1.patch
>
>
> Given a large table, > 1000 regions for example, if a cluster restart is 
> necessary, the compactions undertaken by the regionservers when the master 
> makes initial region assignments can overwhelm DFS, leading to file errors 
> and data loss. This condition is exacerbated if write load was heavy before 
> restart and so many regions want to split as soon as they are opened. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to