[ 
https://issues.apache.org/jira/browse/HBASE-19978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368720#comment-16368720
 ] 

Duo Zhang commented on HBASE-19978:
-----------------------------------

In the old implementation, any workers can timeout, the only check is that if 
the number of workers is not greater than for size than we do not timeout. But 
this is not stable, all the threads can check at the same time and then all 
time our. The default timeout value is Long.MAX_VALUE so no problem for us 
now...

And also, we only add a new thread in stuck checker only if all the workers are 
in use. I believe the reason is that, there is no upper limit for thread count, 
and as said above, we do not have timeout by default, so we should be careful 
when adding new thread.

In the new implementation, I set a upper limit of the number of workers, and 
the keepalive time is default to 1 minute, so we will add new thread more 
aggressively since we will stop at the limit, and they will soon be back to 
normal. I think this could speed up our failover. And the core workers are 
never timeout, only new worker added by stuck checker can time out.

Thanks.

> The keepalive logic is incomplete in ProcedureExecutor
> ------------------------------------------------------
>
>                 Key: HBASE-19978
>                 URL: https://issues.apache.org/jira/browse/HBASE-19978
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Duo Zhang
>            Assignee: Duo Zhang
>            Priority: Major
>             Fix For: 2.0.0-beta-2
>
>         Attachments: HBASE-19978-v1.patch, HBASE-19978.patch
>
>
> The worker thread will just exit after keep alive time, and we never add it 
> back. The only way to add it back is through the stuck checker, this is not 
> correct. Here we should start new worker thread if it is under the core pool 
> size and there are pending procedures.
> For now the default keep alive time is Long.MAX_VALUE which means no timeout 
> so no problem, but we do allow users to set it so we need to fix it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to