[
https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17074472#comment-17074472
]
Ivan Daschinskiy commented on IGNITE-12845:
-------------------------------------------
[~alex_pl] The main reason, AFAIK, is to solve memory leak problems. I.E. [see
this SO discussion
|https://stackoverflow.com/questions/34645752/when-nettys-io-netty-channel-nio-selectedselectionkeyset-hold-too-much-selectio].
If we clear set propertly before every select or selectNow, and nullify when
iterating, we solve original problem too. My solution is just to add some
index (simple int[] array) to existing data structure and move this outdated
implementation toward to [new netty
one|https://github.com/netty/netty/blob/c74b3f3a3b73fee125048b0f486fc9c19fb3bc14/transport/src/main/java/io/netty/channel/nio/SelectedSelectionKeySet.java]
As you see, current netty implementation reset set by nullifying array and this
is not a problem at all for netty since 2016
> GridNioServer can infinitely lose some events
> ----------------------------------------------
>
> Key: IGNITE-12845
> URL: https://issues.apache.org/jira/browse/IGNITE-12845
> Project: Ignite
> Issue Type: Bug
> Reporter: Aleksey Plekhanov
> Priority: Major
>
> With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default)
> {{GridNioServer}} can lose some events for a channel (depending on JDK
> version and OS). It can lead to connected applications hang. Reproducer:
> {code:java}
> public void testConcurrentLoad() throws Exception {
> startGrid(0);
> try (IgniteClient client = Ignition.startClient(new
> ClientConfiguration().setAddresses("127.0.0.1:10800"))) {
> ClientCache<Integer, Integer> cache =
> client.getOrCreateCache(DEFAULT_CACHE_NAME);
> GridTestUtils.runMultiThreaded(
> () -> {
> for (int i = 0; i < 1000; i++)
> cache.put(i, i);
> }, 5, "run-async");
> }
> }
> {code}
> This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13,
> 14), hangs on some Linux environments (for example passed more than 100 times
> on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8,
> 11) and never hanged (passed more than 100 times) on windows system, but
> passes on all systems and JDK versions when system property
> {{IGNITE_NO_SELECTOR_OPTS = true}} is set.
>
> The root cause: optimized {{SelectedSelectionKeySet}} always returns
> {{false}} for {{contains()}} method. The {{contains()}} method used by
> {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method:
> {code:java}
> if (selectedKeys.contains(ski)) {
> if (ski.translateAndUpdateReadyOps(rOps)) {
> return 1;
> }
> } else {
> ski.translateAndSetReadyOps(rOps);
> if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) {
> selectedKeys.add(ski);
> return 1;
> }
> }
> {code}
> So, for fair implementation, if a selection key is contained in the selected
> keys set, then ready operations flags are updated, but for
> {{SelectedSelectionKeySet}} ready operations flags will be always overridden
> and new selector key will be added even if it's already contained in the set.
> Some {{SelectorImpl}} implementations can pass several events for one
> selector key to {{processReadyEvents}} method (for example, MacOs
> implementation {{KQueueSelectorImpl}} works in such a way). In this case,
> duplicated selector keys will be added to {{selectedKeys}} and all events
> except last will be lost.
> Two bad things happen in {{GridNioServer}} due to described above reasons:
> # Some event flags are lost and the worker doesn't process corresponding
> action (for attached reproducer "channel is ready for reading" event is lost
> and the workers never read the channel after some point in time).
> # Duplicated selector keys with the same event flags (for attached
> reproducer it's "channel is ready for writing" event, this duplication leads
> to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which
> will be {{false}} in some cases, but at the same time selector key's
> {{interestedOps}} will contain {{OP_WRITE}} operation and this operation
> never be excluded)
> Possible solutions:
> * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this
> will solve all problems but can be resource consuming)
> * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when
> adding {{OP_WRITE}} to {{interestedOps}} (for example in
> {{AbstractNioClientWorker.registerWrite()}} method). In this case, some
> "channel is ready for reading" events (but not data) still can be lost, but
> not infinitely, and eventually data will be read. If events will be reordered
> (first "channel is ready for writing", after it "channel is ready for
> reading") then write to the channel will be only processed after all data
> will be read.
> * Exclude {{OP_WRITE}} from {{interestedOps}} even if
> {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write
> requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method).
> This solution has the same shortcomings as the previous one.
> * Hybrid approach. Use some probabilistic implementation for {{contains}}
> method (bloom filter or just check the last element) and use one of two
> previous solutions as a workaround, for cases when we incorrectly return
> {{false}} for {{contains}}.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)