[ 
https://issues.apache.org/jira/browse/HELIX-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414392#comment-16414392
 ] 

ASF GitHub Bot commented on HELIX-683:
--------------------------------------

GitHub user zhan849 opened a pull request:

    https://github.com/apache/helix/pull/162

    [HELIX-683] clean monitoring cache upon helix controller enable monitoring

    In this PR I added methods to clear monitoring records in cache when we 
enable cluster status monitoring. I also added tests to reproduce situation 
that a resource missed top state, controller lost leadership, resource regain 
top state, controller regain leadership, which will cause a metrics reporting 
problem

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zhan849/helix 
harry/controller-monitor-cache-cleanup

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/helix/pull/162.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #162
    
----
commit 373da77547fa1ea4a39c760e80da75e9d453d4f5
Author: Harry Zhang <zhan849@...>
Date:   2018-03-26T19:14:07Z

    [HELIX-683] clean monitoring cache upon helix controller enable monitoring

----


> Clean monitoring cache upon helix controller enable monitoring
> --------------------------------------------------------------
>
>                 Key: HELIX-683
>                 URL: https://issues.apache.org/jira/browse/HELIX-683
>             Project: Apache Helix
>          Issue Type: Bug
>            Reporter: Hao Zhang
>            Priority: Major
>
> We found a bug in reporting cluster status, partition masterless duration.
> The root cause is that the duration is calculated based on controller cache. 
> And currently, this cache is not cleaned when leadership is changed. As a 
> result, if controller A start a mastership handoff but was interrupted once, 
> the start time will be kept in cache until next mastership handoff on the 
> same partition happens. Then the later handoff duration will be calculated 
> based on the stale start time. This could be super large.
> To fix it, we might consider clean cache when leadership changed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to