[ 
https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17121055#comment-17121055
 ] 

Aleksey Plekhanov commented on IGNITE-12845:
--------------------------------------------

[~ptupitsyn], can you please check locally your case against the attached 
pull-request?

> GridNioServer can infinitely lose some events 
> ----------------------------------------------
>
>                 Key: IGNITE-12845
>                 URL: https://issues.apache.org/jira/browse/IGNITE-12845
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Aleksey Plekhanov
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) 
> {{GridNioServer}} can lose some events for a channel (depending on JDK 
> version and OS). It can lead to connected applications hang. Reproducer: 
> {code:java}
>     public void testConcurrentLoad() throws Exception {
>         startGrid(0);
>         try (IgniteClient client = Ignition.startClient(new 
> ClientConfiguration().setAddresses("127.0.0.1:10800"))) {
>             ClientCache<Integer, Integer> cache = 
> client.getOrCreateCache(DEFAULT_CACHE_NAME);
>             GridTestUtils.runMultiThreaded(
>                 () -> {
>                     for (int i = 0; i < 1000; i++)
>                         cache.put(i, i);
>                 }, 5, "run-async");
>         }
>     }
> {code}
> This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, 
> 14), hangs on some Linux environments (for example passed more than 100 times 
> on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, 
> 11) and never hanged (passed more than 100 times) on windows system, but 
> passes on all systems and JDK versions when system property 
> {{IGNITE_NO_SELECTOR_OPTS = true}} is set.
>  
> The root cause: optimized {{SelectedSelectionKeySet}} always returns 
> {{false}} for {{contains()}} method. The {{contains()}} method used by 
> {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method:
> {code:java}
> if (selectedKeys.contains(ski)) {
>     if (ski.translateAndUpdateReadyOps(rOps)) {
>         return 1;
>     }
> } else {
>     ski.translateAndSetReadyOps(rOps);
>     if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) {
>         selectedKeys.add(ski);
>         return 1;
>     }
> }
> {code}
> So, for fair implementation, if a selection key is contained in the selected 
> keys set, then ready operations flags are updated, but for 
> {{SelectedSelectionKeySet}} ready operations flags will be always overridden 
> and new selector key will be added even if it's already contained in the set. 
> Some {{SelectorImpl}} implementations can pass several events for one 
> selector key to {{processReadyEvents}} method (for example, MacOs 
> implementation {{KQueueSelectorImpl}} works in such a way). In this case, 
> duplicated selector keys will be added to {{selectedKeys}} and all events 
> except last will be lost.
> Two bad things happen in {{GridNioServer}} due to described above reasons:
>  # Some event flags are lost and the worker doesn't process corresponding 
> action (for attached reproducer "channel is ready for reading" event is lost 
> and the workers never read the channel after some point in time).
>  # Duplicated selector keys with the same event flags (for attached 
> reproducer it's "channel is ready for writing" event, this duplication leads 
> to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which 
> will be {{false}} in some cases, but at the same time selector key's 
> {{interestedOps}} will contain {{OP_WRITE}} operation and this operation 
> never be excluded) 
> Possible solutions:
>  * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this 
> will solve all problems but can be resource consuming)
>  * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when 
> adding {{OP_WRITE}} to {{interestedOps}} (for example in 
> {{AbstractNioClientWorker.registerWrite()}} method). In this case, some 
> "channel is ready for reading" events (but not data) still can be lost, but 
> not infinitely, and eventually data will be read. If events will be reordered 
> (first "channel is ready for writing", after it "channel is ready for 
> reading") then write to the channel will be only processed after all data 
> will be read.
>  * Exclude {{OP_WRITE}} from {{interestedOps}} even if 
> {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write 
> requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). 
> This solution has the same shortcomings as the previous one. 
>  * Hybrid approach. Use some probabilistic implementation for {{contains}} 
> method (bloom filter or just check the last element) and use one of two 
> previous solutions as a workaround, for cases when we incorrectly return 
> {{false}} for {{contains}}. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to