[jira] [Commented] (GEODE-4051) Two server jvms crashed at same time and caused some primary and redundant buckets to be cleared. Causing some buckets to get locked and not able to recover also after bouncing all servers

ASF GitHub Bot (JIRA) Sun, 17 Dec 2017 04:19:30 -0800

    [ 
https://issues.apache.org/jira/browse/GEODE-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294121#comment-16294121
 ]


ASF GitHub Bot commented on GEODE-4051:
---------------------------------------

igorbarc commented on issue #1160: GEODE-4051 Add try catch to catch timeout 
exception
URL: https://github.com/apache/geode/pull/1160#issuecomment-352251784
 
 
   @kirklund Originally, issue reproduced with high load on jvms from client, 
while bouncing jvms. on system with some network issues between hosts, causing 
timeouts
   
   To easily reproduce :
   1) Add hardcoded throw of timeout exception in  waitForCurrentOperations
   2) Start the grid,  with replication of buckets, assign buckets to one of 
the regions
   3) Bounce one for the jvms.  
   
   Result without fix : **Uncaught** exception processing  StateMarkerMessage   
   
   org.apache.geode.GemFireIOException: Current operations did not distribute 
within 5008 milliseconds
           at 
org.apache.geode.distributed.internal.DistributionAdvisor.waitForCurrentOperations(DistributionAdvisor.java:833)
           at 
org.apache.geode.distributed.internal.DistributionAdvisor.waitForCurrentOperations(DistributionAdvisor.java:785)
           at 
org.apache.geode.internal.cache.StateFlushOperation$StateMarkerMessage.waitForCurrentOperations(StateFlushOperation.java:457)
           at 
org.apache.geode.internal.cache.StateFlushOperation$StateMarkerMessage.process(StateFlushOperation.java:362)
           at 
org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:374)
           at 
org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:440)
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
           at 
org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:662)
           at 
org.apache.geode.distributed.internal.DistributionManager$5$1.run(DistributionManager.java:958)
           at java.lang.Thread.run(Thread.java:748)
   
   The flush marker  message will not be sent, and buckets without redundancy 
will not return to zero, even after re-balance 
   
   
   Result with fix : in log 
   )  **Exception caught** while determining channel state
   org.apache.geode.GemFireIOException: Current operations did not distribute 
within 5048 milliseconds
           at 
org.apache.geode.distributed.internal.DistributionAdvisor.waitForCurrentOperations(DistributionAdvisor.java:833)
           at 
org.apache.geode.distributed.internal.DistributionAdvisor.waitForCurrentOperations(DistributionAdvisor.java:785)
           at 
org.apache.geode.internal.cache.StateFlushOperation$StateMarkerMessage.waitForCurrentOperations(StateFlushOperation.java:487)
           at 
org.apache.geode.internal.cache.StateFlushOperation$StateMarkerMessage.process(StateFlushOperation.java:362)
           at 
org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:374)
           at 
org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:440)
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
           at 
org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:662)
           at 
org.apache.geode.distributed.internal.DistributionManager$5$1.run(DistributionManager.java:958)
           at java.lang.Thread.run(Thread.java:748)
   
   buckets without redundancy will return to zero after jvm start 
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Two server jvms crashed at same time and caused some primary and redundant 
> buckets to be cleared. Causing some buckets to get locked and not able to 
> recover also after bouncing all servers
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GEODE-4051
>                 URL: https://issues.apache.org/jira/browse/GEODE-4051
>             Project: Geode
>          Issue Type: Bug
>          Components: regions
>    Affects Versions: 1.2.0
>            Reporter: Igor Barchak
>
> "Pooled Waiting Message Processor 5" tid=0x162
>     java.lang.Thread.State: TIMED_WAITING
>         at sun.misc.Unsafe.park(Native Method)
>         -  waiting on java.util.concurrent.CountDownLatch$Sync@1993a5
>         at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>         at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
>         at 
> org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:64)
>         at 
> org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:715)
>         at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForReplies(ReplyProcessor21.java:644)
>         at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForReplies(ReplyProcessor21.java:624)
>         at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForReplies(ReplyProcessor21.java:519)
>         at 
> org.apache.geode.internal.cache.StateFlushOperation.flush(StateFlushOperation.java:243)
>         at 
> org.apache.geode.internal.cache.InitialImageOperation.getFromOne(InitialImageOperation.java:349)
>         at 
> org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1168)
>         at 
> org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1023)
>         at 
> org.apache.geode.internal.cache.BucketRegion.initialize(BucketRegion.java:253)
>         at 
> org.apache.geode.internal.cache.LocalRegion.createSubregion(LocalRegion.java:962)
>         at 
> org.apache.geode.internal.cache.PartitionedRegionDataStore.createBucketRegion(PartitionedRegionDataStore.java:726)
>         at 
> org.apache.geode.internal.cache.PartitionedRegionDataStore.grabFreeBucket(PartitionedRegionDataStore.java:414)
>         -  locked org.apache.geode.internal.cache.ProxyBucketRegion@6820a0b6
>         at 
> org.apache.geode.internal.cache.PartitionedRegionDataStore.grabFreeBucketRecursively(PartitionedRegionDataStore.java:272)
>         at 
> org.apache.geode.internal.cache.PartitionedRegionDataStore.grabBucket(PartitionedRegionDataStore.java:2815)
>         at 
> org.apache.geode.internal.cache.partitioned.ManageBackupBucketMessage.operateOnPartitionedRegion(ManageBackupBucketMessage.java:148)
>         at 
> org.apache.geode.internal.cache.partitioned.PartitionMessage.process(PartitionMessage.java:332)
> Seems like it was introduced in this fix
> https://github.com/apache/geode/commit/3a1062e245b3ded52ea3f6b6de0aff94ce846fa3?diff=split
> See StateMarkerMessage.process
> The first if condition doesn't have a finally block.
> The else has a finally block.
> The first if condition didn't have a 'waitFor' operation earlier - it was 
> introduced in this commit



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (GEODE-4051) Two server jvms crashed at same time and caused some primary and redundant buckets to be cleared. Causing some buckets to get locked and not able to recover also after bouncing all servers

Reply via email to