[
https://issues.apache.org/jira/browse/GEODE-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523880#comment-16523880
]
ASF subversion and git services commented on GEODE-5349:
--------------------------------------------------------
Commit e388d8f4b6c861900279779388d7ceb0c004cd9a in geode's branch
refs/heads/feature/GEODE-5349 from [~bschuchardt]
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=e388d8f ]
GEODE-5349 State-flush operation may exit early allowing for cache inconsistency
Removed the ability for this method to exit without the operation count
falling to zero. Instead it issues a fatal-level log message, which
translates into a severe-level alert for operators. This can help tech
support know which server a customer should terminate in order to break
a distributed deadlock.
I also added an info-level message that is issued if a warning/fatal message
has been issued noting that the wait has completed. This parallels what
we do in ReplyProcessor21 if we've issued a warning that a cache-op response
hasn't been received within the ack-wait-threshold period.
> State-flush operation may terminate waiting for current operations, allowing
> for cache inconsistency
> ----------------------------------------------------------------------------------------------------
>
> Key: GEODE-5349
> URL: https://issues.apache.org/jira/browse/GEODE-5349
> Project: Geode
> Issue Type: Bug
> Components: regions
> Reporter: Bruce Schuchardt
> Assignee: Bruce Schuchardt
> Priority: Major
>
> The state-flush operation relies in part on
> DistributionAdvisor.waitForCurrentOperations() to stall until in-process
> replication efforts have written their messages to communication channels.
> This method currently has a self-imposed time limit of
> (2*ack-wait-threshold)-1 seconds, which defaults to 29 seconds. If a cache
> operation, say a transaction commit, happens to take longer than this the
> waitForCurrentOperations() method will terminate early, possibly allowing a
> new copy of a region to miss the changes contained in that cache operation.
> We should remove the timeout in waitForCurrentOperations and rigorously test
> the change.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)