[
https://issues.apache.org/jira/browse/GEODE-654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vincent Ford reassigned GEODE-654:
----------------------------------
Assignee: Vincent Ford
> Interaction of GII and LIFO eviction can cause high CPU and recovery issues
> ---------------------------------------------------------------------------
>
> Key: GEODE-654
> URL: https://issues.apache.org/jira/browse/GEODE-654
> Project: Geode
> Issue Type: Bug
> Components: core
> Reporter: Vincent Ford
> Assignee: Vincent Ford
>
> This was seen in a customer engagement with GemFire but also effects Geode
> and the usage of LIFO eviction algorithm. The GemFire support team led by
> David Wisler identified the following pattern. In the specific case the GII
> was blocked from completing processing and also blocked a put into the
> processing queue of a Gateway ( not part of Geode but would effect any usage
> of LIFO eviction ).
> (** 1 **)
> "Pooled High Priority Message Processor 4" daemon prio=10
> tid=0x00002ba27c008800 nid=0x3fc5 waiting for monitor entry
> [0x00002ba1894f2000]
> java.lang.Thread.State: BLOCKED (on object monitor)
> at
> com.gemstone.gemfire.internal.cache.InitialImageOperation$RequestImageMessage.chunkEntries(InitialImageOperation.java:1555)
> - waiting to lock <0x00000005b81413d0> (a
> com.gemstone.gemfire.internal.cache.VersionedStatsRegionEntry)
> 1554 synchronized(mapEntry) { // bug #46042 must sync to make sure the tag
> goes with the value
> 1555 VersionSource<?> id = stamp.getMemberID();
> 1556 if (id == null) { id = myId; }
> 1557 foundIds.add(id);
> 1558 // if the recipient passed a version vector, use it to
> filter out
> 1559 // entries the recipient already has
> (** 2 **)
> "pool-26-thread-4" prio=10 tid=0x00002ba390042800 nid=0x4073 waiting for
> monitor entry [0x00002ba18b4d0000]
> java.lang.Thread.State: BLOCKED (on object monitor)
> at
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:205)
> - waiting to lock <0x00000005b7624058> (a
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue)
> at
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.queuePrimaryEvent(SerialGatewaySenderEventProcessor.java:440)
> at
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.enqueueEvent(SerialGatewaySenderEventProcessor.java:416)
> at
> com.gemstone.gemfire.internal.cache.wan.AbstractGatewaySender.distribute(AbstractGatewaySender.java:964)
> at
> com.gemstone.gemfire.internal.cache.LocalRegion.notifyGatewayHubs(LocalRegion.java:6295)
> at
> com.gemstone.gemfire.internal.cache.LocalRegion.basicPutPart2(LocalRegion.java:5858)
> at
> com.gemstone.gemfire.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:3203)
> - locked <0x00000005b81413d0> (a
> com.gemstone.gemfire.internal.cache.VersionedStatsRegionEntry)
> Below it is clear that we are not getting out of the eviction loop, for some
> reason the method "evictEntry" is returning 0 as the amount of evicted bytes
> and that's why following thread keeps swapping between BLOCKED and RUNNABLE
> and making no progress.
> [2015-09-08 12:39:09]
> "pool-26-thread-2" prio=10 tid=0x00002ba390005000 nid=0x406f waiting for
> monitor entry [0x00002ba18b34b000]
> java.lang.Thread.State: BLOCKED (on object monitor)
> at
> com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.evictEntry(AbstractLRURegionMap.java:302)
> - locked <0x00000005b9b7c8c8> (a
> com.gemstone.gemfire.internal.cache.VersionedThinDiskLRURegionEntry)
> at
> com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.lruUpdateCallback(AbstractLRURegionMap.java:522)
> at
> com.gemstone.gemfire.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:3272)
> at
> com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.basicPut(AbstractLRURegionMap.java:45)
> at
> com.gemstone.gemfire.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5669)
> at
> com.gemstone.gemfire.internal.cache.DistributedRegion.virtualPut(DistributedRegion.java:375)
> at
> com.gemstone.gemfire.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:118)
> at
> com.gemstone.gemfire.internal.cache.LocalRegion.basicPut(LocalRegion.java:5050)
> at
> com.gemstone.gemfire.internal.cache.LocalRegion.validatedPut(LocalRegion.java:1747)
> at
> com.gemstone.gemfire.internal.cache.LocalRegion.put(LocalRegion.java:1729)
> at
> com.gemstone.gemfire.internal.cache.AbstractRegion.put(AbstractRegion.java:286)
> at
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.putAndGetKey(SerialGatewaySenderQueue.java:220)
> at
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:212)
> - locked <0x00000005b7624058> (a
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue)
> [2015-09-08 12:40:14]
> "pool-26-thread-2" prio=10 tid=0x00002ba390005000 nid=0x406f runnable
> [0x00002ba18b34b000]
> java.lang.Thread.State: RUNNABLE
> at
> com.gemstone.gemfire.internal.cache.lru.NewLIFOClockHand.getLRUEntry(NewLIFOClockHand.java:40)
> - locked <0x00000005b8c55c90> (a
> com.gemstone.gemfire.internal.cache.lru.NewLRUClockHand$HeadLock)
> at
> com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.lruUpdateCallback(AbstractLRURegionMap.java:520)
> at
> com.gemstone.gemfire.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:3272)
> at
> com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.basicPut(AbstractLRURegionMap.java:45)
> at
> com.gemstone.gemfire.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5669)
> at
> com.gemstone.gemfire.internal.cache.DistributedRegion.virtualPut(DistributedRegion.java:375)
> at
> com.gemstone.gemfire.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:118)
> at
> com.gemstone.gemfire.internal.cache.LocalRegion.basicPut(LocalRegion.java:5050)
> at
> com.gemstone.gemfire.internal.cache.LocalRegion.validatedPut(LocalRegion.java:1747)
> at
> com.gemstone.gemfire.internal.cache.LocalRegion.put(LocalRegion.java:1729)
> at
> com.gemstone.gemfire.internal.cache.AbstractRegion.put(AbstractRegion.java:286)
> at
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.putAndGetKey(SerialGatewaySenderQueue.java:220)
> at
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:212)
> - locked <0x00000005b7624058> (a
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue)
> From the AbstractLRURegionMap...
> 527 while (_getCCHelper().mustEvict(stats, _getOwner(),
> bytesToEvict)) {
> 528 LRUEntry removalEntry =
> (LRUEntry)_getLruList().getLRUEntry();
> 529 if (removalEntry != null) {
> 530 if (evictEntry(removalEntry, stats) != 0) {
> 531 if (debug) {
> 532 debugLogging("evicted entry
> key(2)=" + removalEntry.getKey()
> 533 + " total entry size is
> now: " + getTotalEntrySize()
> 534 + " bytesToEvict :" +
> bytesToEvict);
> 535 // if (NewLRUClockHand.debug &&
> logWriter != null) {
> 536 // _getLruList().dumpList(logWriter);
> 537 // }
> 538 }
> 539 stats.incEvictions();
> 540 if (_isOwnerALocalRegion()) {
> 541 if (_getOwner() instanceof
> BucketRegion) {
> 542
> ((BucketRegion)_getOwner()).incEvictions(1);
> 543 }
> 544 }
> 545 if (debug)
> 546 debugLogging("evictions=" +
> stats.getEvictions());
> 547 _getCCHelper().afterEviction();
> 548
> 549 }
> 550 }
> 551
> 552 else {
> 553 if (debug && getTotalEntrySize() != 0) {
> 554 debugLogging("leaving evict loop
> early");
> 555 }
> 556
> 557 break;
> 558 }
> 559 }
> Summary---
> 1. Serial gateway sender queues use LIFO_MEMORY as their eviction algorithm
> and others may choose this option as well
> 2. Due to the above, when we mustEvict, we call getLRUEntry implementation of
> NewLIFEClockHand
> 3. Unfortunately, this LIFO based implementation doesn't seem to work in all
> cases during gII. This implementation always returns simply the last entry in
> the queue without directly removing it. It is possible to have the last
> element on the list (returned by getLRUEntry for LIFO_MEMORY) be in a state
> that causes us to not evict any bytes, but then just continue looping forever
> expecting state to change.
> 4. This appears to be a corner case race condition and will be difficult to
> test or reproduce given the timings and data distribution required to
> encounter the issue. For example to encounter the distributed system created
> by Geode must be in GII between members, have entry in queue that can not be
> evicted ( part of transaction, certain internal tokens...), queue is in state
> requiring eviction.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)