[
https://issues.apache.org/jira/browse/HBASE-15436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15198866#comment-15198866
]
Anoop Sam John commented on HBASE-15436:
----------------------------------------
There are some must fix things
1. The BufferedMutator flush is keep on trying and taking more time. It
kicked as the size of all Mutations accumulated so far, met the flush size.
(Say 2 MB). The flush takes time and we keep on accepting new mutations into
the list. This may lead to client side OOME !.. We may need to accept more
mutations after a background started. Normally things will get moving faster.
But this cannot be infinite. There should be a cap size for the size above
which we should block the writes. We should not take more than this limit. May
be some thing like 1.5 times of what is the flush size.
2. The row lookups into META happening for one row at a time. So this makes its
such that one row lookup failed after 36 retries and each having 1 min timeout.
The 1 min time out itself is so high? And even after that it just fails this
one Mutation and continue with remaining. What if we were doing multi Get to
META table to know the region location for N mutations at a time.
3. When close() is explicitly called on BufferedMutator, we try for graceful
down (ie. wait for a flush if one is there in progress and/or call flush before
close). In such case what if the cluster is down and it takes too long. How
long we should wait? Whether we should come out faster? (May be loosing some
Mutations, but that is any way known) (?)
> BufferedMutatorImpl.flush() appears to get stuck
> ------------------------------------------------
>
> Key: HBASE-15436
> URL: https://issues.apache.org/jira/browse/HBASE-15436
> Project: HBase
> Issue Type: Bug
> Components: Client
> Affects Versions: 1.0.2
> Reporter: Sangjin Lee
> Attachments: hbaseException.log, threaddump.log
>
>
> We noticed an instance where the thread that was executing a flush
> ({{BufferedMutatorImpl.flush()}}) got stuck when the (local one-node) cluster
> shut down and was unable to get out of that stuck state.
> The setup is a single node HBase cluster, and apparently the cluster went
> away when the client was executing flush. The flush eventually logged a
> failure after 30+ minutes of retrying. That is understandable.
> What is unexpected is that thread is stuck in this state (i.e. in the
> {{flush()}} call). I would have expected the {{flush()}} call to return after
> the complete failure.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)