[
https://issues.apache.org/jira/browse/GEODE-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294121#comment-16294121
]
ASF GitHub Bot commented on GEODE-4051:
---------------------------------------
igorbarc commented on issue #1160: GEODE-4051 Add try catch to catch timeout
exception
URL: https://github.com/apache/geode/pull/1160#issuecomment-352251784
@kirklund Originally, issue reproduced with high load on jvms from client,
while bouncing jvms. on system with some network issues between hosts, causing
timeouts
To easily reproduce :
1) Add hardcoded throw of timeout exception in waitForCurrentOperations
2) Start the grid, with replication of buckets, assign buckets to one of
the regions
3) Bounce one for the jvms.
Result without fix : **Uncaught** exception processing StateMarkerMessage
org.apache.geode.GemFireIOException: Current operations did not distribute
within 5008 milliseconds
at
org.apache.geode.distributed.internal.DistributionAdvisor.waitForCurrentOperations(DistributionAdvisor.java:833)
at
org.apache.geode.distributed.internal.DistributionAdvisor.waitForCurrentOperations(DistributionAdvisor.java:785)
at
org.apache.geode.internal.cache.StateFlushOperation$StateMarkerMessage.waitForCurrentOperations(StateFlushOperation.java:457)
at
org.apache.geode.internal.cache.StateFlushOperation$StateMarkerMessage.process(StateFlushOperation.java:362)
at
org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:374)
at
org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:440)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at
org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:662)
at
org.apache.geode.distributed.internal.DistributionManager$5$1.run(DistributionManager.java:958)
at java.lang.Thread.run(Thread.java:748)
The flush marker message will not be sent, and buckets without redundancy
will not return to zero, even after re-balance
Result with fix : in log
) **Exception caught** while determining channel state
org.apache.geode.GemFireIOException: Current operations did not distribute
within 5048 milliseconds
at
org.apache.geode.distributed.internal.DistributionAdvisor.waitForCurrentOperations(DistributionAdvisor.java:833)
at
org.apache.geode.distributed.internal.DistributionAdvisor.waitForCurrentOperations(DistributionAdvisor.java:785)
at
org.apache.geode.internal.cache.StateFlushOperation$StateMarkerMessage.waitForCurrentOperations(StateFlushOperation.java:487)
at
org.apache.geode.internal.cache.StateFlushOperation$StateMarkerMessage.process(StateFlushOperation.java:362)
at
org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:374)
at
org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:440)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at
org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:662)
at
org.apache.geode.distributed.internal.DistributionManager$5$1.run(DistributionManager.java:958)
at java.lang.Thread.run(Thread.java:748)
buckets without redundancy will return to zero after jvm start
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Two server jvms crashed at same time and caused some primary and redundant
> buckets to be cleared. Causing some buckets to get locked and not able to
> recover also after bouncing all servers
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: GEODE-4051
> URL: https://issues.apache.org/jira/browse/GEODE-4051
> Project: Geode
> Issue Type: Bug
> Components: regions
> Affects Versions: 1.2.0
> Reporter: Igor Barchak
>
> "Pooled Waiting Message Processor 5" tid=0x162
> java.lang.Thread.State: TIMED_WAITING
> at sun.misc.Unsafe.park(Native Method)
> - waiting on java.util.concurrent.CountDownLatch$Sync@1993a5
> at
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
> at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
> at
> org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:64)
> at
> org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:715)
> at
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForReplies(ReplyProcessor21.java:644)
> at
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForReplies(ReplyProcessor21.java:624)
> at
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForReplies(ReplyProcessor21.java:519)
> at
> org.apache.geode.internal.cache.StateFlushOperation.flush(StateFlushOperation.java:243)
> at
> org.apache.geode.internal.cache.InitialImageOperation.getFromOne(InitialImageOperation.java:349)
> at
> org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1168)
> at
> org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1023)
> at
> org.apache.geode.internal.cache.BucketRegion.initialize(BucketRegion.java:253)
> at
> org.apache.geode.internal.cache.LocalRegion.createSubregion(LocalRegion.java:962)
> at
> org.apache.geode.internal.cache.PartitionedRegionDataStore.createBucketRegion(PartitionedRegionDataStore.java:726)
> at
> org.apache.geode.internal.cache.PartitionedRegionDataStore.grabFreeBucket(PartitionedRegionDataStore.java:414)
> - locked org.apache.geode.internal.cache.ProxyBucketRegion@6820a0b6
> at
> org.apache.geode.internal.cache.PartitionedRegionDataStore.grabFreeBucketRecursively(PartitionedRegionDataStore.java:272)
> at
> org.apache.geode.internal.cache.PartitionedRegionDataStore.grabBucket(PartitionedRegionDataStore.java:2815)
> at
> org.apache.geode.internal.cache.partitioned.ManageBackupBucketMessage.operateOnPartitionedRegion(ManageBackupBucketMessage.java:148)
> at
> org.apache.geode.internal.cache.partitioned.PartitionMessage.process(PartitionMessage.java:332)
> Seems like it was introduced in this fix
> https://github.com/apache/geode/commit/3a1062e245b3ded52ea3f6b6de0aff94ce846fa3?diff=split
> See StateMarkerMessage.process
> The first if condition doesn't have a finally block.
> The else has a finally block.
> The first if condition didn't have a 'waitFor' operation earlier - it was
> introduced in this commit
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)