[
https://issues.apache.org/jira/browse/IGNITE-14474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rodion Smolnikov updated IGNITE-14474:
--------------------------------------
Reviewer: Vyacheslav Koptilin (was: Denis Chudov)
> Improve error message in case rebalance fails
> ---------------------------------------------
>
> Key: IGNITE-14474
> URL: https://issues.apache.org/jira/browse/IGNITE-14474
> Project: Ignite
> Issue Type: Improvement
> Affects Versions: 2.5
> Reporter: Denis Chudov
> Assignee: Rodion Smolnikov
> Priority: Major
> Fix For: 2.9.2
>
> Time Spent: 1h
> Remaining Estimate: 0h
>
> Currently we can get a message like this when rebalance fails with an
> exception (examples from ignite 2.5, in newer versions the log messages were
> changed but the problem is still actual):
> {code:java}
> 2019-11-27 13:41:14,504[WARN ][utility-#79%xxx%][GridDhtPartitionDemander]
> Rebalancing from node cancelled [grp=ignite-sys-cache,
> topVer=AffinityTopologyVersion [topVer=1932, minorTopVer=1],
> supplier=f014f30a-77f2-4459-aa5b-6c12907a7449, topic=0]. Supply message
> couldn't be unmarshalled: class o.a.i.IgniteCheckedException: Failed to
> unmarshal object with optimized marshaller
> 2019-11-27 13:41:14,504[INFO ][utility-#79%xxx%][GridDhtPartitionDemander]
> Cancelled rebalancing [grp=ignite-sys-cache,
> supplier=f014f30a-77f2-4459-aa5b-6c12907a7449, topVer=AffinityTopologyVersion
> [topVer=1932, minorTopVer=1], time=88 ms]
> 2019-11-27 13:41:14,508[WARN ][utility-#76%xxx%][GridDhtPartitionDemander]
> Rebalancing from node cancelled [grp=ignite-sys-cache,
> topVer=AffinityTopologyVersion [topVer=1932, minorTopVer=1],
> supplier=dfa5ee06-48c9-4458-ae55-48cc6ceda998, topic=0]. Supply message
> couldn't be unmarshalled: class o.a.i.IgniteCheckedException: Failed to
> unmarshal object with optimized marshaller
> {code}
> In the case above, a marshalling exception leads to rebalance failure which
> will never be resolved - i.e. the cluster enters into a erroneous state.
> We should report issues like this as ERROR. The message should explain that
> the rebalance has failed, data for the cache was not fully copied to the
> node, the backup factor is not recovered and the cluster may not work
> correctly.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)