Aleksey Plekhanov created IGNITE-19111:
------------------------------------------

             Summary: Storage corruption if pages changed after last checkpoint 
during deactivation
                 Key: IGNITE-19111
                 URL: https://issues.apache.org/jira/browse/IGNITE-19111
             Project: Ignite
          Issue Type: Bug
            Reporter: Aleksey Plekhanov
            Assignee: Aleksey Plekhanov


During cluster deactivation we force checkpoint (with "caches stop" reason) and 
remove checkpoint listeners before actual caches stop. But if there are some 
activity with data pages on the node after that checkpoint, but before caches 
stops and next checkpoint is started, the storage can be corrupted.

Reproducer:
{code:java}
    /** {@inheritDoc} */
    @Override protected IgniteConfiguration getConfiguration(String 
igniteInstanceName) throws Exception {
        return super.getConfiguration(igniteInstanceName)
            .setDataStorageConfiguration(new DataStorageConfiguration()
                .setDefaultDataRegionConfiguration(new 
DataRegionConfiguration().setPersistenceEnabled(true))
                .setCheckpointFrequency(1_000L))
            .setFailureHandler(new StopNodeFailureHandler());
    }

    /** */
    @Test
    public void testCpAfterClusterDeactivate() throws Exception {
        IgniteEx ignite0 = startGrid(0);
        IgniteEx ignite1 = startGrid(1);

        ignite0.cluster().state(ClusterState.ACTIVE);

        ignite0.getOrCreateCache(new 
CacheConfiguration<>(DEFAULT_CACHE_NAME).setBackups(1)
            .setAffinity(new RendezvousAffinityFunction(false, 10)));

        try (IgniteDataStreamer<Integer, Integer> streamer = 
ignite0.dataStreamer(DEFAULT_CACHE_NAME)) {
            for (int i = 0; i < 100_000; i++)
                streamer.addData(i, i);
        }

        stopGrid(0);

        try (IgniteDataStreamer<Integer, Integer> streamer = 
ignite1.dataStreamer(DEFAULT_CACHE_NAME)) {
            streamer.allowOverwrite(true);
            for (int i = 0; i < 100_000; i++)
                streamer.addData(i, i + 1);
        }

        ignite0 = startGrid(0);
        
((GridCacheDatabaseSharedManager)ignite0.context().cache().context().database()).addCheckpointListener(new
 CheckpointListener() {
            @Override public void onMarkCheckpointBegin(Context ctx) {
                // No-op.
            }

            @Override public void onCheckpointBegin(Context ctx) {
                if ("caches stop".equals(ctx.progress().reason()))
                    doSleep(1_000L);
            }

            @Override public void beforeCheckpointBegin(Context ctx) {
                // No-op.
            }
        });

        ignite0.cluster().state(ClusterState.INACTIVE);

        doSleep(2_000L);

        ignite0.cluster().state(ClusterState.ACTIVE);

        IgniteCache<Integer, Integer> cache = ignite0.cache(DEFAULT_CACHE_NAME);

        for (int i = 0; i < 100_000; i++)
            assertEquals((Integer)(i + 1), cache.get(i));
    } {code}
This reproducer shuts down the node with some probability (about 1/5 on my 
laptop) on activation or on last check with {{{}CorruptedTreeException{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to