Vincent Ford created GEODE-654:
----------------------------------

             Summary: Interaction of GII and LIFO eviction can cause high CPU 
and recovery issues
                 Key: GEODE-654
                 URL: https://issues.apache.org/jira/browse/GEODE-654
             Project: Geode
          Issue Type: Bug
          Components: core
            Reporter: Vincent Ford


This was seen in a customer engagement with GemFire but also effects Geode and 
the usage of  LIFO eviction algorithm.  The GemFire support team led by David 
Wisler identified the following pattern. In the specific case the GII was 
blocked from completing processing and also blocked a put into the processing 
queue of a Gateway ( not part of Geode but would effect any usage of LIFO 
eviction ).

(** 1 **)
"Pooled High Priority Message Processor 4" daemon prio=10 
tid=0x00002ba27c008800 nid=0x3fc5 waiting for monitor entry [0x00002ba1894f2000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
com.gemstone.gemfire.internal.cache.InitialImageOperation$RequestImageMessage.chunkEntries(InitialImageOperation.java:1555)
        - waiting to lock <0x00000005b81413d0> (a 
com.gemstone.gemfire.internal.cache.VersionedStatsRegionEntry)

1554    synchronized(mapEntry) { // bug #46042 must sync to make sure the tag 
goes with the value
1555            VersionSource<?> id = stamp.getMemberID();
1556            if (id == null) { id = myId; }
1557                    foundIds.add(id);
1558                    // if the recipient passed a version vector, use it to 
filter out
1559                    // entries the recipient already has

(** 2 **)
"pool-26-thread-4" prio=10 tid=0x00002ba390042800 nid=0x4073 waiting for 
monitor entry [0x00002ba18b4d0000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:205)
        - waiting to lock <0x00000005b7624058> (a 
com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue)
        at 
com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.queuePrimaryEvent(SerialGatewaySenderEventProcessor.java:440)
        at 
com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.enqueueEvent(SerialGatewaySenderEventProcessor.java:416)
        at 
com.gemstone.gemfire.internal.cache.wan.AbstractGatewaySender.distribute(AbstractGatewaySender.java:964)
        at 
com.gemstone.gemfire.internal.cache.LocalRegion.notifyGatewayHubs(LocalRegion.java:6295)
        at 
com.gemstone.gemfire.internal.cache.LocalRegion.basicPutPart2(LocalRegion.java:5858)
        at 
com.gemstone.gemfire.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:3203)
        - locked <0x00000005b81413d0> (a 
com.gemstone.gemfire.internal.cache.VersionedStatsRegionEntry)

Below it is clear that we are not getting out of the eviction loop, for some 
reason the method "evictEntry" is returning 0 as the amount of evicted bytes 
and that's why following  thread keeps swapping between BLOCKED and RUNNABLE 
and making no progress.

[2015-09-08 12:39:09]
"pool-26-thread-2" prio=10 tid=0x00002ba390005000 nid=0x406f waiting for 
monitor entry [0x00002ba18b34b000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.evictEntry(AbstractLRURegionMap.java:302)
        - locked <0x00000005b9b7c8c8> (a 
com.gemstone.gemfire.internal.cache.VersionedThinDiskLRURegionEntry)
        at 
com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.lruUpdateCallback(AbstractLRURegionMap.java:522)
        at 
com.gemstone.gemfire.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:3272)
        at 
com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.basicPut(AbstractLRURegionMap.java:45)
        at 
com.gemstone.gemfire.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5669)
        at 
com.gemstone.gemfire.internal.cache.DistributedRegion.virtualPut(DistributedRegion.java:375)
        at 
com.gemstone.gemfire.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:118)
        at 
com.gemstone.gemfire.internal.cache.LocalRegion.basicPut(LocalRegion.java:5050)
        at 
com.gemstone.gemfire.internal.cache.LocalRegion.validatedPut(LocalRegion.java:1747)
        at 
com.gemstone.gemfire.internal.cache.LocalRegion.put(LocalRegion.java:1729)
        at 
com.gemstone.gemfire.internal.cache.AbstractRegion.put(AbstractRegion.java:286)
        at 
com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.putAndGetKey(SerialGatewaySenderQueue.java:220)
        at 
com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:212)
        - locked <0x00000005b7624058> (a 
com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue)

[2015-09-08 12:40:14]
"pool-26-thread-2" prio=10 tid=0x00002ba390005000 nid=0x406f runnable 
[0x00002ba18b34b000]
   java.lang.Thread.State: RUNNABLE
        at 
com.gemstone.gemfire.internal.cache.lru.NewLIFOClockHand.getLRUEntry(NewLIFOClockHand.java:40)
        - locked <0x00000005b8c55c90> (a 
com.gemstone.gemfire.internal.cache.lru.NewLRUClockHand$HeadLock)
        at 
com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.lruUpdateCallback(AbstractLRURegionMap.java:520)
        at 
com.gemstone.gemfire.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:3272)
        at 
com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.basicPut(AbstractLRURegionMap.java:45)
        at 
com.gemstone.gemfire.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5669)
        at 
com.gemstone.gemfire.internal.cache.DistributedRegion.virtualPut(DistributedRegion.java:375)
        at 
com.gemstone.gemfire.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:118)
        at 
com.gemstone.gemfire.internal.cache.LocalRegion.basicPut(LocalRegion.java:5050)
        at 
com.gemstone.gemfire.internal.cache.LocalRegion.validatedPut(LocalRegion.java:1747)
        at 
com.gemstone.gemfire.internal.cache.LocalRegion.put(LocalRegion.java:1729)
        at 
com.gemstone.gemfire.internal.cache.AbstractRegion.put(AbstractRegion.java:286)
        at 
com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.putAndGetKey(SerialGatewaySenderQueue.java:220)
        at 
com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:212)
        - locked <0x00000005b7624058> (a 
com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue)

>From the AbstractLRURegionMap...
527             while (_getCCHelper().mustEvict(stats, _getOwner(), 
bytesToEvict)) {
528                     LRUEntry removalEntry = 
(LRUEntry)_getLruList().getLRUEntry();
529                     if (removalEntry != null) {
530                             if (evictEntry(removalEntry, stats) != 0) {
531                                     if (debug) {
532                                             debugLogging("evicted entry 
key(2)=" + removalEntry.getKey()
533                                                     + " total entry size is 
now: " + getTotalEntrySize()
534                                                     + " bytesToEvict :" + 
bytesToEvict);
535                     //                if (NewLRUClockHand.debug && 
logWriter != null) {
536                     //                  _getLruList().dumpList(logWriter);
537                     //                }
538                                     }
539                                     stats.incEvictions();
540                                     if (_isOwnerALocalRegion()) {
541                                             if (_getOwner() instanceof 
BucketRegion) {
542                                                     
((BucketRegion)_getOwner()).incEvictions(1);
543                                             }
544                                     }
545                                     if (debug)
546                                             debugLogging("evictions=" + 
stats.getEvictions());
547                                             _getCCHelper().afterEviction();
548
549                             }
550                     }
551
552                     else {
553                             if (debug && getTotalEntrySize() != 0) {
554                                     debugLogging("leaving evict loop 
early");
555                             }
556
557                             break;
558                     }
559             }

Summary---
1. Serial gateway sender queues use LIFO_MEMORY as their eviction algorithm and 
others may choose this option as well
2. Due to the above, when we mustEvict, we call getLRUEntry implementation of 
NewLIFEClockHand 
3. Unfortunately, this LIFO based implementation doesn't seem to work in all 
cases during gII. This implementation always returns simply the last entry in 
the queue without directly removing it. It is possible to have the last element 
on the list (returned by getLRUEntry for LIFO_MEMORY) be in a state that causes 
us to not evict any bytes, but then just continue looping forever expecting 
state to change.
4. This appears to be a corner case race condition and will be difficult to 
test or reproduce given the timings and data distribution required to encounter 
the issue. For example to encounter the distributed system created by Geode 
must be in GII between members, have entry in queue that can not be evicted ( 
part of transaction, certain internal tokens...), queue is in state requiring 
eviction. 







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to