[
https://issues.apache.org/jira/browse/HBASE-8321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13633070#comment-13633070
]
Jeffrey Zhong commented on HBASE-8321:
--------------------------------------
The second patch looks good to me(+1) with one small comment:
{code}
+ report_period = conf.getInt("hbase.splitlog.report.period",
+ conf.getInt("hbase.splitlog.manager.timeout",
+ SplitLogManager.DEFAULT_TIMEOUT) / 2);
....
public boolean progress() {
- if (!attemptToOwnTask(false)) {
- LOG.warn("Failed to heartbeat the task" + currentTask);
- return false;
+ long t = EnvironmentEdgeManager.currentTimeMillis();
+ if ((t - last_report_at) > report_period) {
+ last_report_at = t;
+ if (!attemptToOwnTask(false)) {
+ LOG.warn("Failed to heartbeat the task" + currentTask);
+ return false;
+ }
{code}
In the latest patch, we heartbeat only after a report_period which by default
is SplitLogManager.TIMEOUT/ 2. If splitLogWorker miss one(e.g. it tries to
report right before a report_period) and next report take a little longer than
one report_period then the work will be preempted by SplitLogManager.
Therefore, I'd suggest we change report_period default value to
SplitLogManager.TIMEOUT/5 or something you think is more appropriate.
> Log split worker should heartbeat to avoid timeout when the hlog is under
> recovery
> ----------------------------------------------------------------------------------
>
> Key: HBASE-8321
> URL: https://issues.apache.org/jira/browse/HBASE-8321
> Project: HBase
> Issue Type: Bug
> Components: wal
> Reporter: Jimmy Xiang
> Assignee: Jimmy Xiang
> Attachments: trunk-8321_v1.patch, trunk-8321_v2.patch
>
>
> Currently, hlog splitter could spend quite sometime to split a log in case
> any HDFS issue and recoverLease/retry opening is needed. If distributed log
> split manager times out the log worker, other log worker to take over will
> run into the same issue.
> Ideally, we should not need a timeout monitor. Since we have a timeout
> monitor for DSL now, the worker should heartbeat to avoid wrong/unneeded
> timeouts.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira