[ 
https://issues.apache.org/jira/browse/GEODE-654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053859#comment-15053859
 ] 

ASF subversion and git services commented on GEODE-654:
-------------------------------------------------------

Commit f7670e1688a95a026a0376724bfe2277cb16eb94 in incubator-geode's branch 
refs/heads/develop from Vince Ford
[ https://git-wip-us.apache.org/repos/asf?p=incubator-geode.git;h=f7670e1 ]

GEODE-654: Changes LIFO queue to remove touched entry

Removes any entry that has been touched while processing
the LIFO queue. Prevents issue when eviction gets stuck
during startup and GII and an entry can not be removed.


> Interaction of GII and LIFO eviction can cause high CPU and recovery issues
> ---------------------------------------------------------------------------
>
>                 Key: GEODE-654
>                 URL: https://issues.apache.org/jira/browse/GEODE-654
>             Project: Geode
>          Issue Type: Bug
>          Components: core
>            Reporter: Vincent Ford
>            Assignee: Vincent Ford
>
> This was seen in a customer engagement with GemFire but also effects Geode 
> and the usage of  LIFO eviction algorithm.  The GemFire support team led by 
> David Wisler identified the following pattern. In the specific case the GII 
> was blocked from completing processing and also blocked a put into the 
> processing queue of a Gateway ( not part of Geode but would effect any usage 
> of LIFO eviction ).
> (** 1 **)
> "Pooled High Priority Message Processor 4" daemon prio=10 
> tid=0x00002ba27c008800 nid=0x3fc5 waiting for monitor entry 
> [0x00002ba1894f2000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>       at 
> com.gemstone.gemfire.internal.cache.InitialImageOperation$RequestImageMessage.chunkEntries(InitialImageOperation.java:1555)
>       - waiting to lock <0x00000005b81413d0> (a 
> com.gemstone.gemfire.internal.cache.VersionedStatsRegionEntry)
> 1554  synchronized(mapEntry) { // bug #46042 must sync to make sure the tag 
> goes with the value
> 1555          VersionSource<?> id = stamp.getMemberID();
> 1556          if (id == null) { id = myId; }
> 1557                  foundIds.add(id);
> 1558                  // if the recipient passed a version vector, use it to 
> filter out
> 1559                  // entries the recipient already has
> (** 2 **)
> "pool-26-thread-4" prio=10 tid=0x00002ba390042800 nid=0x4073 waiting for 
> monitor entry [0x00002ba18b4d0000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>       at 
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:205)
>       - waiting to lock <0x00000005b7624058> (a 
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue)
>       at 
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.queuePrimaryEvent(SerialGatewaySenderEventProcessor.java:440)
>       at 
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.enqueueEvent(SerialGatewaySenderEventProcessor.java:416)
>       at 
> com.gemstone.gemfire.internal.cache.wan.AbstractGatewaySender.distribute(AbstractGatewaySender.java:964)
>       at 
> com.gemstone.gemfire.internal.cache.LocalRegion.notifyGatewayHubs(LocalRegion.java:6295)
>       at 
> com.gemstone.gemfire.internal.cache.LocalRegion.basicPutPart2(LocalRegion.java:5858)
>       at 
> com.gemstone.gemfire.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:3203)
>       - locked <0x00000005b81413d0> (a 
> com.gemstone.gemfire.internal.cache.VersionedStatsRegionEntry)
> Below it is clear that we are not getting out of the eviction loop, for some 
> reason the method "evictEntry" is returning 0 as the amount of evicted bytes 
> and that's why following  thread keeps swapping between BLOCKED and RUNNABLE 
> and making no progress.
> [2015-09-08 12:39:09]
> "pool-26-thread-2" prio=10 tid=0x00002ba390005000 nid=0x406f waiting for 
> monitor entry [0x00002ba18b34b000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>       at 
> com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.evictEntry(AbstractLRURegionMap.java:302)
>       - locked <0x00000005b9b7c8c8> (a 
> com.gemstone.gemfire.internal.cache.VersionedThinDiskLRURegionEntry)
>       at 
> com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.lruUpdateCallback(AbstractLRURegionMap.java:522)
>       at 
> com.gemstone.gemfire.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:3272)
>       at 
> com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.basicPut(AbstractLRURegionMap.java:45)
>       at 
> com.gemstone.gemfire.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5669)
>       at 
> com.gemstone.gemfire.internal.cache.DistributedRegion.virtualPut(DistributedRegion.java:375)
>       at 
> com.gemstone.gemfire.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:118)
>       at 
> com.gemstone.gemfire.internal.cache.LocalRegion.basicPut(LocalRegion.java:5050)
>       at 
> com.gemstone.gemfire.internal.cache.LocalRegion.validatedPut(LocalRegion.java:1747)
>       at 
> com.gemstone.gemfire.internal.cache.LocalRegion.put(LocalRegion.java:1729)
>       at 
> com.gemstone.gemfire.internal.cache.AbstractRegion.put(AbstractRegion.java:286)
>       at 
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.putAndGetKey(SerialGatewaySenderQueue.java:220)
>       at 
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:212)
>       - locked <0x00000005b7624058> (a 
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue)
> [2015-09-08 12:40:14]
> "pool-26-thread-2" prio=10 tid=0x00002ba390005000 nid=0x406f runnable 
> [0x00002ba18b34b000]
>    java.lang.Thread.State: RUNNABLE
>       at 
> com.gemstone.gemfire.internal.cache.lru.NewLIFOClockHand.getLRUEntry(NewLIFOClockHand.java:40)
>       - locked <0x00000005b8c55c90> (a 
> com.gemstone.gemfire.internal.cache.lru.NewLRUClockHand$HeadLock)
>       at 
> com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.lruUpdateCallback(AbstractLRURegionMap.java:520)
>       at 
> com.gemstone.gemfire.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:3272)
>       at 
> com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.basicPut(AbstractLRURegionMap.java:45)
>       at 
> com.gemstone.gemfire.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5669)
>       at 
> com.gemstone.gemfire.internal.cache.DistributedRegion.virtualPut(DistributedRegion.java:375)
>       at 
> com.gemstone.gemfire.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:118)
>       at 
> com.gemstone.gemfire.internal.cache.LocalRegion.basicPut(LocalRegion.java:5050)
>       at 
> com.gemstone.gemfire.internal.cache.LocalRegion.validatedPut(LocalRegion.java:1747)
>       at 
> com.gemstone.gemfire.internal.cache.LocalRegion.put(LocalRegion.java:1729)
>       at 
> com.gemstone.gemfire.internal.cache.AbstractRegion.put(AbstractRegion.java:286)
>       at 
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.putAndGetKey(SerialGatewaySenderQueue.java:220)
>       at 
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:212)
>       - locked <0x00000005b7624058> (a 
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue)
> From the AbstractLRURegionMap...
> 527           while (_getCCHelper().mustEvict(stats, _getOwner(), 
> bytesToEvict)) {
> 528                   LRUEntry removalEntry = 
> (LRUEntry)_getLruList().getLRUEntry();
> 529                   if (removalEntry != null) {
> 530                           if (evictEntry(removalEntry, stats) != 0) {
> 531                                   if (debug) {
> 532                                           debugLogging("evicted entry 
> key(2)=" + removalEntry.getKey()
> 533                                                   + " total entry size is 
> now: " + getTotalEntrySize()
> 534                                                   + " bytesToEvict :" + 
> bytesToEvict);
> 535                   //                if (NewLRUClockHand.debug && 
> logWriter != null) {
> 536                   //                  _getLruList().dumpList(logWriter);
> 537                   //                }
> 538                                   }
> 539                                   stats.incEvictions();
> 540                                   if (_isOwnerALocalRegion()) {
> 541                                           if (_getOwner() instanceof 
> BucketRegion) {
> 542                                                   
> ((BucketRegion)_getOwner()).incEvictions(1);
> 543                                           }
> 544                                   }
> 545                                   if (debug)
> 546                                           debugLogging("evictions=" + 
> stats.getEvictions());
> 547                                           _getCCHelper().afterEviction();
> 548
> 549                           }
> 550                   }
> 551
> 552                   else {
> 553                           if (debug && getTotalEntrySize() != 0) {
> 554                                   debugLogging("leaving evict loop 
> early");
> 555                           }
> 556
> 557                           break;
> 558                   }
> 559           }
> Summary---
> 1. Serial gateway sender queues use LIFO_MEMORY as their eviction algorithm 
> and others may choose this option as well
> 2. Due to the above, when we mustEvict, we call getLRUEntry implementation of 
> NewLIFEClockHand 
> 3. Unfortunately, this LIFO based implementation doesn't seem to work in all 
> cases during gII. This implementation always returns simply the last entry in 
> the queue without directly removing it. It is possible to have the last 
> element on the list (returned by getLRUEntry for LIFO_MEMORY) be in a state 
> that causes us to not evict any bytes, but then just continue looping forever 
> expecting state to change.
> 4. This appears to be a corner case race condition and will be difficult to 
> test or reproduce given the timings and data distribution required to 
> encounter the issue. For example to encounter the distributed system created 
> by Geode must be in GII between members, have entry in queue that can not be 
> evicted ( part of transaction, certain internal tokens...), queue is in state 
> requiring eviction. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to