[jira] [Comment Edited] (IGNITE-13366) Special mode for maintenance of Ignite node. Employing Maintenance Mode for clearing corrupted PDS files.

Pavel Pereslegin (Jira) Thu, 15 Oct 2020 04:38:29 -0700


    [ 
https://issues.apache.org/jira/browse/IGNITE-13366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17214625#comment-17214625
 ]


Pavel Pereslegin edited comment on IGNITE-13366 at 10/15/20, 11:37 AM:
-----------------------------------------------------------------------

[~sergeychugunov],

could you please help me with the test failure that occurred after applying 
this patch?

The test verifies that we can restart the node during rebalancing.,
{code:java}
public class RestartDuringRebalancingTest extends GridCommonAbstractTest {
    @Override protected IgniteConfiguration getConfiguration(String 
igniteInstanceName) throws Exception {
        return 
super.getConfiguration(igniteInstanceName).setDataStorageConfiguration(new 
DataStorageConfiguration()
            .setDefaultDataRegionConfiguration(new 
DataRegionConfiguration().setPersistenceEnabled(true)));
    }

    @Test
    public void testRestartDuringRebalancing() throws Exception {
        cleanPersistenceDir();

        startGrids(2);

        grid(0).cluster().state(ClusterState.ACTIVE);

        startGrid(2);

        resetBaselineTopology();

        stopAllGrids();

        startGrids(3).cluster().state(ClusterState.ACTIVE);

        awaitPartitionMapExchange();
    }
}
{code}
This test fails with the following exception
{noformat}
class org.apache.ignite.IgniteCheckedException: Cache groups with potentially 
corrupted partition files found. To cleanup them maintenance is needed, node 
will enter maintenance mode on next restart. Cleanup cache group folders 
manually or trigger maintenance action to do that and restart the node. 
Corrupted files are located in subdirectories [cache-ignite-sys-cache] in a 
work dir 
/home/xtern/src/java/ignite/source/work/db/node02-a9790e24-4880-4d5b-aede-cb1e96308ad7
 at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1438)
        at 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2096)
        at 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1748)
        at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1143)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:641)
        at 
org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:1229)
        at 
org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:1150)
        at 
org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:1126)
        at 
org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:995)
        at 
org.apache.ignite.testframework.junits.GridAbstractTest.startGrids(GridAbstractTest.java:837)
        at 
org.apache.ignite.internal.processors.cache.persistence.RestartDuringRebalancingTest.testRestartDuringRebalancing(RestartDuringRebalancingTest.java:30)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
        at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
        at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at 
org.apache.ignite.testframework.junits.GridAbstractTest$7.run(GridAbstractTest.java:2373)
        at java.lang.Thread.run(Thread.java:748)
Caused by: class org.apache.ignite.IgniteException: Cache groups with 
potentially corrupted partition files found. To cleanup them maintenance is 
needed, node will enter maintenance mode on next restart. Cleanup cache group 
folders manually or trigger maintenance action to do that and restart the node. 
Corrupted files are located in subdirectories [cache-ignite-sys-cache] in a 
work dir 
/home/xtern/src/java/ignite/source/work/db/node02-a9790e24-4880-4d5b-aede-cb1e96308ad7
        at 
org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.beginRecover(FilePageStoreManager.java:388)
        at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.performBinaryMemoryRestore(GridCacheDatabaseSharedManager.java:1776)
        at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restoreBinaryMemory(GridCacheDatabaseSharedManager.java:837)
        at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.startMemoryRestore(GridCacheDatabaseSharedManager.java:1608)
        at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1282)
        ... 20 more{noformat}
AFAIK we clear all restored cache data on baseline change (see usage of 
GridCacheDatabaseSharedManager#cleanupRestoredCaches) why can't we start the 
node in this case?

I don't know is this a bug or feature - how can I fix this test?

 


was (Author: xtern):
[~sergeychugunov],

could you please help me with the test failure that occurred after applying 
this patch?

The test verifies that we can restart the node during rebalancing.,
{code:java}
public class RestartDuringRebalancingTest extends GridCommonAbstractTest {
    @Override protected IgniteConfiguration getConfiguration(String 
igniteInstanceName) throws Exception {
        return 
super.getConfiguration(igniteInstanceName).setDataStorageConfiguration(new 
DataStorageConfiguration()
            .setDefaultDataRegionConfiguration(new 
DataRegionConfiguration().setPersistenceEnabled(true)));
    }

    @Test
    public void testRestartDuringRebalancing() throws Exception {
        cleanPersistenceDir();

        startGrids(2);

        grid(0).cluster().state(ClusterState.ACTIVE);

        startGrid(2);

        resetBaselineTopology();

        stopAllGrids();

        startGrids(3).cluster().state(ClusterState.ACTIVE);

        awaitPartitionMapExchange();
    }
}
{code}
This test fails with the following exception
{noformat}
class org.apache.ignite.IgniteCheckedException: Cache groups with potentially 
corrupted partition files found. To cleanup them maintenance is needed, node 
will enter maintenance mode on next restart. Cleanup cache group folders 
manually or trigger maintenance action to do that and restart the node. 
Corrupted files are located in subdirectories [cache-ignite-sys-cache] in a 
work dir 
/home/xtern/src/java/ignite/source/work/db/node02-a9790e24-4880-4d5b-aede-cb1e96308ad7
 at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1438)
        at 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2096)
        at 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1748)
        at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1143)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:641)
        at 
org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:1229)
        at 
org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:1150)
        at 
org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:1126)
        at 
org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:995)
        at 
org.apache.ignite.testframework.junits.GridAbstractTest.startGrids(GridAbstractTest.java:837)
        at 
org.apache.ignite.internal.processors.cache.persistence.RestartDuringRebalancingTest.testRestartDuringRebalancing(RestartDuringRebalancingTest.java:30)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
        at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
        at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at 
org.apache.ignite.testframework.junits.GridAbstractTest$7.run(GridAbstractTest.java:2373)
        at java.lang.Thread.run(Thread.java:748)
Caused by: class org.apache.ignite.IgniteException: Cache groups with 
potentially corrupted partition files found. To cleanup them maintenance is 
needed, node will enter maintenance mode on next restart. Cleanup cache group 
folders manually or trigger maintenance action to do that and restart the node. 
Corrupted files are located in subdirectories [cache-ignite-sys-cache] in a 
work dir 
/home/xtern/src/java/ignite/source/work/db/node02-a9790e24-4880-4d5b-aede-cb1e96308ad7
        at 
org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.beginRecover(FilePageStoreManager.java:388)
        at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.performBinaryMemoryRestore(GridCacheDatabaseSharedManager.java:1776)
        at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restoreBinaryMemory(GridCacheDatabaseSharedManager.java:837)
        at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.startMemoryRestore(GridCacheDatabaseSharedManager.java:1608)
        at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1282)
        ... 20 more{noformat}
AFAIK we clear all restored cache data on baseline change (see using of 
GridCacheDatabaseSharedManager#cleanupRestoredCaches) why can't we start the 
node in this case?

I don't know is this a bug or feature - how can I fix this test?

 

> Special mode for maintenance of Ignite node. Employing Maintenance Mode for 
> clearing corrupted PDS files.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-13366
>                 URL: https://issues.apache.org/jira/browse/IGNITE-13366
>             Project: Ignite
>          Issue Type: New Feature
>          Components: persistence
>    Affects Versions: 2.8.1
>            Reporter: Sergey Chugunov
>            Assignee: Sergey Chugunov
>            Priority: Critical
>              Labels: IEP-53
>             Fix For: 2.10
>
>   Original Estimate: 168h
>          Time Spent: 1h 50m
>  Remaining Estimate: 166h 10m
>
> If node with persistence is stopped when WAL was disabled for a cache (no 
> matters because of rebalancing in progress or by explicit user request) on 
> next node start all data files of that cache are removed automatically and 
> unconditionally.
> This behavior may be unexpected for users as they may not understand all 
> consequences of disabling WAL locally (for rebalancing) or globally (via 
> IgniteCluster API call). Also it is not smart enough as there is no point in 
> deleting consistent data files.
> We should change this behavior to the following list: no automatic deletions 
> whatsoever. If data files are consistent (equivalent to: no checkpoint was 
> running when node was stopped) start up normally. If data files are 
> corrupted, don't let the node start.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (IGNITE-13366) Special mode for maintenance of Ignite node. Employing Maintenance Mode for clearing corrupted PDS files.

Reply via email to