[ 
https://issues.apache.org/jira/browse/HBASE-15436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15198866#comment-15198866
 ] 

Anoop Sam John commented on HBASE-15436:
----------------------------------------

There are some must fix things
1.  The BufferedMutator flush is keep on trying and taking more time.  It 
kicked as the size of all Mutations accumulated so far, met the flush size. 
(Say 2 MB).  The flush takes time and we keep on accepting new mutations into 
the list. This may lead to client side OOME !.. We may need to accept more 
mutations after a background started. Normally things will get moving faster. 
But this cannot be infinite.  There should be a cap size for the size above 
which we should block the writes. We should not take more than this limit. May 
be some thing like 1.5 times of what is the flush size.
2. The row lookups into META happening for one row at a time. So this makes its 
such that one row lookup failed after 36 retries and each having 1 min timeout. 
 The 1 min time out itself is so high? And even after that it just fails this 
one Mutation and continue with remaining.  What if we were doing multi Get to 
META table to know the region location for N mutations at a time.
3. When close() is explicitly called on BufferedMutator, we try for graceful 
down (ie. wait for a flush if one is there in progress and/or call flush before 
close).  In such case what if the cluster is down and it takes too long. How 
long we should wait?  Whether we should come out faster?  (May be loosing some 
Mutations, but that is any way known) (?)

> BufferedMutatorImpl.flush() appears to get stuck
> ------------------------------------------------
>
>                 Key: HBASE-15436
>                 URL: https://issues.apache.org/jira/browse/HBASE-15436
>             Project: HBase
>          Issue Type: Bug
>          Components: Client
>    Affects Versions: 1.0.2
>            Reporter: Sangjin Lee
>         Attachments: hbaseException.log, threaddump.log
>
>
> We noticed an instance where the thread that was executing a flush 
> ({{BufferedMutatorImpl.flush()}}) got stuck when the (local one-node) cluster 
> shut down and was unable to get out of that stuck state.
> The setup is a single node HBase cluster, and apparently the cluster went 
> away when the client was executing flush. The flush eventually logged a 
> failure after 30+ minutes of retrying. That is understandable.
> What is unexpected is that thread is stuck in this state (i.e. in the 
> {{flush()}} call). I would have expected the {{flush()}} call to return after 
> the complete failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to