[
https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17074377#comment-17074377
]
Aleksey Plekhanov commented on IGNITE-12845:
--------------------------------------------
[~ivandasch], it still will be O(MAX_SIZE) for reset operation.
NIO is a critical part of Ignite, any change of underlying storage for
{{SelectedSelectionKeySet}} is risky and must be carefully tested for
performance. I think bug should be fixed as simple as possible and the ticket
should be targeted to 2.8.1. I propose a 2-nd or 3-rd solution from the ticket
description. It's simple, it solves the problem (it's not the most optimal
solution, but at least the problem will be not critical anymore), it doesn't
affect performance. Also, a ticket for improvement can be created and targeted
to 2.9 or later release. Feel free to create such a ticket. Perhaps the
solution should be discussed on the dev-list to involve more participants to
the discussion.
I.e. let's fix the bug by the "Bug" ticket targeted to 2.8.1 and make an
improvement by the "Improvement" ticket targeted to 2.9. WDYT?
> GridNioServer can infinitely lose some events
> ----------------------------------------------
>
> Key: IGNITE-12845
> URL: https://issues.apache.org/jira/browse/IGNITE-12845
> Project: Ignite
> Issue Type: Bug
> Reporter: Aleksey Plekhanov
> Priority: Major
>
> With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default)
> {{GridNioServer}} can lose some events for a channel (depending on JDK
> version and OS). It can lead to connected applications hang. Reproducer:
> {code:java}
> public void testConcurrentLoad() throws Exception {
> startGrid(0);
> try (IgniteClient client = Ignition.startClient(new
> ClientConfiguration().setAddresses("127.0.0.1:10800"))) {
> ClientCache<Integer, Integer> cache =
> client.getOrCreateCache(DEFAULT_CACHE_NAME);
> GridTestUtils.runMultiThreaded(
> () -> {
> for (int i = 0; i < 1000; i++)
> cache.put(i, i);
> }, 5, "run-async");
> }
> }
> {code}
> This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13,
> 14), hangs on some Linux environments (for example passed more than 100 times
> on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8,
> 11) and never hanged (passed more than 100 times) on windows system, but
> passes on all systems and JDK versions when system property
> {{IGNITE_NO_SELECTOR_OPTS = true}} is set.
>
> The root cause: optimized {{SelectedSelectionKeySet}} always returns
> {{false}} for {{contains()}} method. The {{contains()}} method used by
> {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method:
> {code:java}
> if (selectedKeys.contains(ski)) {
> if (ski.translateAndUpdateReadyOps(rOps)) {
> return 1;
> }
> } else {
> ski.translateAndSetReadyOps(rOps);
> if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) {
> selectedKeys.add(ski);
> return 1;
> }
> }
> {code}
> So, for fair implementation, if a selection key is contained in the selected
> keys set, then ready operations flags are updated, but for
> {{SelectedSelectionKeySet}} ready operations flags will be always overridden
> and new selector key will be added even if it's already contained in the set.
> Some {{SelectorImpl}} implementations can pass several events for one
> selector key to {{processReadyEvents}} method (for example, MacOs
> implementation {{KQueueSelectorImpl}} works in such a way). In this case,
> duplicated selector keys will be added to {{selectedKeys}} and all events
> except last will be lost.
> Two bad things happen in {{GridNioServer}} due to described above reasons:
> # Some event flags are lost and the worker doesn't process corresponding
> action (for attached reproducer "channel is ready for reading" event is lost
> and the workers never read the channel after some point in time).
> # Duplicated selector keys with the same event flags (for attached
> reproducer it's "channel is ready for writing" event, this duplication leads
> to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which
> will be {{false}} in some cases, but at the same time selector key's
> {{interestedOps}} will contain {{OP_WRITE}} operation and this operation
> never be excluded)
> Possible solutions:
> * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this
> will solve all problems but can be resource consuming)
> * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when
> adding {{OP_WRITE}} to {{interestedOps}} (for example in
> {{AbstractNioClientWorker.registerWrite()}} method). In this case, some
> "channel is ready for reading" events (but not data) still can be lost, but
> not infinitely, and eventually data will be read. If events will be reordered
> (first "channel is ready for writing", after it "channel is ready for
> reading") then write to the channel will be only processed after all data
> will be read.
> * Exclude {{OP_WRITE}} from {{interestedOps}} even if
> {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write
> requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method).
> This solution has the same shortcomings as the previous one.
> * Hybrid approach. Use some probabilistic implementation for {{contains}}
> method (bloom filter or just check the last element) and use one of two
> previous solutions as a workaround, for cases when we incorrectly return
> {{false}} for {{contains}}.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)