[jira] [Updated] (IGNITE-14474) Improve error message in case rebalance fails

Rodion Smolnikov (Jira) Wed, 02 Jun 2021 04:03:04 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-14474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rodion Smolnikov updated IGNITE-14474:
--------------------------------------
    Description: 
Currently we can get a message like this when rebalance fails with an exception 
(examples from ignite 2.5, in newer versions the log messages were changed but 
the problem is still actual):
{code:java}
2019-11-27 13:41:14,504[WARN ][utility-#79%xxx%][GridDhtPartitionDemander] 
Rebalancing from node cancelled [grp=ignite-sys-cache, 
topVer=AffinityTopologyVersion [topVer=1932, minorTopVer=1], 
supplier=f014f30a-77f2-4459-aa5b-6c12907a7449, topic=0]. Supply message 
couldn't be unmarshalled: class o.a.i.IgniteCheckedException: Failed to 
unmarshal object with optimized marshaller
2019-11-27 13:41:14,504[INFO ][utility-#79%xxx%][GridDhtPartitionDemander] 
Cancelled rebalancing [grp=ignite-sys-cache, 
supplier=f014f30a-77f2-4459-aa5b-6c12907a7449, topVer=AffinityTopologyVersion 
[topVer=1932, minorTopVer=1], time=88 ms]
2019-11-27 13:41:14,508[WARN ][utility-#76%xxx%][GridDhtPartitionDemander] 
Rebalancing from node cancelled [grp=ignite-sys-cache, 
topVer=AffinityTopologyVersion [topVer=1932, minorTopVer=1], 
supplier=dfa5ee06-48c9-4458-ae55-48cc6ceda998, topic=0]. Supply message 
couldn't be unmarshalled: class o.a.i.IgniteCheckedException: Failed to 
unmarshal object with optimized marshaller
{code}
In the case above, a marshalling exception leads to rebalance failure which 
will never be resolved - i.e. the cluster enters into a erroneous state.

We should report issues like this as ERROR. The message should explain that the 
rebalance has failed, data for the cache was not fully copied to the node, the 
backup factor is not recovered and the cluster may not work correctly.

 

After fix:

New message will looks like this:
{code:java}
[2021-06-02 
13:52:33,762][ERROR][rebalance-#110%rebalancing.GridCacheRebalancingUnmarshallingFailedSelfTest1%][root]
 Rebalancing routine has failed, some partitions could be unavailable for 
reading [grp=cache, rebalanceId=1, topVer=AffinityTopologyVersion [topVer=2, 
minorTopVer=0], supplier=bf744bda-ba3d-4f48-8172-26d642000000, 
unavailablePartitions=[1-256, 768-1024]]
{code}

added rebalanceId and unavailablePartitions

  was:
Currently we can get a message like this when rebalance fails with an exception 
(examples from ignite 2.5, in newer versions the log messages were changed but 
the problem is still actual):
{code:java}
2019-11-27 13:41:14,504[WARN ][utility-#79%xxx%][GridDhtPartitionDemander] 
Rebalancing from node cancelled [grp=ignite-sys-cache, 
topVer=AffinityTopologyVersion [topVer=1932, minorTopVer=1], 
supplier=f014f30a-77f2-4459-aa5b-6c12907a7449, topic=0]. Supply message 
couldn't be unmarshalled: class o.a.i.IgniteCheckedException: Failed to 
unmarshal object with optimized marshaller
2019-11-27 13:41:14,504[INFO ][utility-#79%xxx%][GridDhtPartitionDemander] 
Cancelled rebalancing [grp=ignite-sys-cache, 
supplier=f014f30a-77f2-4459-aa5b-6c12907a7449, topVer=AffinityTopologyVersion 
[topVer=1932, minorTopVer=1], time=88 ms]
2019-11-27 13:41:14,508[WARN ][utility-#76%xxx%][GridDhtPartitionDemander] 
Rebalancing from node cancelled [grp=ignite-sys-cache, 
topVer=AffinityTopologyVersion [topVer=1932, minorTopVer=1], 
supplier=dfa5ee06-48c9-4458-ae55-48cc6ceda998, topic=0]. Supply message 
couldn't be unmarshalled: class o.a.i.IgniteCheckedException: Failed to 
unmarshal object with optimized marshaller
{code}
In the case above, a marshalling exception leads to rebalance failure which 
will never be resolved - i.e. the cluster enters into a erroneous state.

We should report issues like this as ERROR. The message should explain that the 
rebalance has failed, data for the cache was not fully copied to the node, the 
backup factor is not recovered and the cluster may not work correctly.


> Improve error message in case rebalance fails
> ---------------------------------------------
>
>                 Key: IGNITE-14474
>                 URL: https://issues.apache.org/jira/browse/IGNITE-14474
>             Project: Ignite
>          Issue Type: Improvement
>    Affects Versions: 2.5
>            Reporter: Denis Chudov
>            Assignee: Rodion Smolnikov
>            Priority: Major
>             Fix For: 2.9.2
>
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Currently we can get a message like this when rebalance fails with an 
> exception (examples from ignite 2.5, in newer versions the log messages were 
> changed but the problem is still actual):
> {code:java}
> 2019-11-27 13:41:14,504[WARN ][utility-#79%xxx%][GridDhtPartitionDemander] 
> Rebalancing from node cancelled [grp=ignite-sys-cache, 
> topVer=AffinityTopologyVersion [topVer=1932, minorTopVer=1], 
> supplier=f014f30a-77f2-4459-aa5b-6c12907a7449, topic=0]. Supply message 
> couldn't be unmarshalled: class o.a.i.IgniteCheckedException: Failed to 
> unmarshal object with optimized marshaller
> 2019-11-27 13:41:14,504[INFO ][utility-#79%xxx%][GridDhtPartitionDemander] 
> Cancelled rebalancing [grp=ignite-sys-cache, 
> supplier=f014f30a-77f2-4459-aa5b-6c12907a7449, topVer=AffinityTopologyVersion 
> [topVer=1932, minorTopVer=1], time=88 ms]
> 2019-11-27 13:41:14,508[WARN ][utility-#76%xxx%][GridDhtPartitionDemander] 
> Rebalancing from node cancelled [grp=ignite-sys-cache, 
> topVer=AffinityTopologyVersion [topVer=1932, minorTopVer=1], 
> supplier=dfa5ee06-48c9-4458-ae55-48cc6ceda998, topic=0]. Supply message 
> couldn't be unmarshalled: class o.a.i.IgniteCheckedException: Failed to 
> unmarshal object with optimized marshaller
> {code}
> In the case above, a marshalling exception leads to rebalance failure which 
> will never be resolved - i.e. the cluster enters into a erroneous state.
> We should report issues like this as ERROR. The message should explain that 
> the rebalance has failed, data for the cache was not fully copied to the 
> node, the backup factor is not recovered and the cluster may not work 
> correctly.
>  
> After fix:
> New message will looks like this:
> {code:java}
> [2021-06-02 
> 13:52:33,762][ERROR][rebalance-#110%rebalancing.GridCacheRebalancingUnmarshallingFailedSelfTest1%][root]
>  Rebalancing routine has failed, some partitions could be unavailable for 
> reading [grp=cache, rebalanceId=1, topVer=AffinityTopologyVersion [topVer=2, 
> minorTopVer=0], supplier=bf744bda-ba3d-4f48-8172-26d642000000, 
> unavailablePartitions=[1-256, 768-1024]]
> {code}
> added rebalanceId and unavailablePartitions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (IGNITE-14474) Improve error message in case rebalance fails

Reply via email to