Ilya Kasnacheev created IGNITE-7540:
---------------------------------------

             Summary: Sequential checkpoints cause overwrite of already cleaned 
& freed offheap page
                 Key: IGNITE-7540
                 URL: https://issues.apache.org/jira/browse/IGNITE-7540
             Project: Ignite
          Issue Type: Bug
          Components: persistence
    Affects Versions: 2.4
            Reporter: Ilya Kasnacheev
            Assignee: Alexey Goncharuk


The sequence of events as follows:

in GridCacheProcessor.onExchangeDone(), 
{color:#660e7a}sharedCtx{color}.database().waitForCheckpoint({color:#008000}"caches
 stop"{color}) is peformed and then cache is destroyed and all its pages are 
freed and cleared asynchronously.

However, it is entirely possible that after waitForCheckpoint(), next 
checkpoint will start immediately. It is typical when a lot of data being 
loaded into Ignite, leading to rapid checkpoint buffer depletion, as well as 
with artificially increased checkpoint frequency, as used in reproducer.

Then, checkpointer will save (overwrite) metadata page:
{code:java}
    at 
org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlockPage(PageMemoryImpl.java:1330)
    at 
org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:428)
    at 
org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:422)
    at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.saveStoreMetadata(GridCacheOffheapManager.java:375)
    at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.onCheckpointBegin(GridCacheOffheapManager.java:163)
    at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.markCheckpointBegin(GridCacheDatabaseSharedManager.java:2309)
    at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.doCheckpoint(GridCacheDatabaseSharedManager.java:2088)
    at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.body(GridCacheDatabaseSharedManager.java:2013)
    at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
    at java.lang.Thread.run(Thread.java:748){code}
This will happen after cache is already destroyed and even after the page is 
already zeroed by PageMemoryImpl$ClearSegmentRunnable.run().

Then, some new cache is being created, and in 
GridCacheOffheapManager$GridCacheDataStore.getOrAllocatePartitionMetas(), 
pageMem.acquirePage() will return this page, expected zeroed, but actually 
containing metadata for old cache's partition. Then, type == PageIO.T_PART_META 
check will return true and the following exception is issued, leading to cache 
state inconsistency and data loss:
{code:java}
Caused by: java.lang.IllegalStateException: Failed to get page IO instance 
(page content is corrupted)
    at 
org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forVersion(IOVersions.java:83)
    at 
org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forPage(IOVersions.java:95)
    at 
org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.init(PagesList.java:175)
    at 
org.apache.ignite.internal.processors.cache.persistence.freelist.FreeListImpl.<init>(FreeListImpl.java:370)
    at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore$1.<init>(GridCacheOffheapManager.java:932)
    at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.init0(GridCacheOffheapManager.java:929)
    at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:1295)
    at 
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:344)
    at 
org.apache.ignite.internal.processors.cache.GridCacheMapEntry.storeValue(GridCacheMapEntry.java:3191)
    at 
org.apache.ignite.internal.processors.cache.GridCacheMapEntry.initialValue(GridCacheMapEntry.java:2571)
    at 
org.apache.ignite.internal.processors.datastreamer.DataStreamerImpl$IsolatedUpdater.receive(DataStreamerImpl.java:2096)
    at 
org.apache.ignite.internal.processors.datastreamer.DataStreamerUpdateJob.call(DataStreamerUpdateJob.java:140)
    at 
org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.localUpdate(DataStreamProcessor.java:397)
    at 
org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.processRequest(DataStreamProcessor.java:302)
    at 
org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.access$000(DataStreamProcessor.java:59)
    at 
org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor$1.onMessage(DataStreamProcessor.java:89)
    ... 6 more{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to