[
https://issues.apache.org/jira/browse/IGNITE-13643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vladimir Steshin updated IGNITE-13643:
--------------------------------------
Description:
Current IgniteUtils.closeQuiet(@Nullable Socket sock) takes about 5sec to close
socket. We should include socket linger in failureDetectionTimeout. This
violates node detection failure. Despite we set failureDetectionTiemout ==
1000, node failure is detected within 6.5 secs in average. Logging shows delay
on socket closing in IgniteUtils.closeQuiet(@Nullable Socket sock).
This time gap was unearther by a discovery integration test on ducktape [1].
Failure detection timeout is set to 1000ms.
Typical results before the fix for 1 node:
"Detection of node(s) failure (ms)": 6140, "All detection delays (ms):":
"[6140]", "Nodes failed": 1}
Typical results after the fix for 1 node:
"Detection of node(s) failure (ms)": 1004, "All detection delays (ms):":
"[1004]", "Nodes failed": 1}
Suggestion: use forced closing, set soLinger=0, do now wait for rest of the
socket IO. We close socket in TcpDiscoverySpi when we already waited for target
timeouts and consider connection is lost or invalid. We do not need to wait for
any traffic on the socket any more.
There is note that 'graceful' socket closing was made to workaround bag in
OpenJDK12 [1]. But as I see it has been fixed.
But we should take in account known issues with SSL connection where linger
might be nesessary.
[1]
https://github.com/apache/ignite/blob/ignite-ducktape/modules/ducktests/tests/ignitetest/tests/discovery_test.py
[2] https://bugs.openjdk.java.net/browse/JDK-8219658
was:
Current IgniteUtils.closeQuiet(@Nullable Socket sock) takes about 5sec to close
socket. Probably it is default soTimeout. This violates node detection failure.
Despite we set failureDetectionTiemout == 1000, node failure is detected within
6.5 secs in average. Logging shows delay on socket closing in
IgniteUtils.closeQuiet(@Nullable Socket sock).
This time gap was unearther by a discovery integration test on ducktape [1].
Failure detection timeout is set to 1000ms.
Typical results before the fix for 1 node:
"Detection of node(s) failure (ms)": 6140, "All detection delays (ms):":
"[6140]", "Nodes failed": 1}
Typical results after the fix for 1 node:
"Detection of node(s) failure (ms)": 1004, "All detection delays (ms):":
"[1004]", "Nodes failed": 1}
Suggestion: use forced closing, set soLinger=0, do now wait for rest of the
socket IO. We close socket in TcpDiscoverySpi when we already waited for target
timeouts and consider connection is lost or invalid. We do not need to wait for
any traffic on the socket any more.
There is note that 'graceful' socket closing was made to workaround bag in
OpenJDK12 [1]. But as I see it has been fixed.
But we should take in account known issues with SSL connection where linger
might be nesessary.
[1]
https://github.com/apache/ignite/blob/ignite-ducktape/modules/ducktests/tests/ignitetest/tests/discovery_test.py
[2] https://bugs.openjdk.java.net/browse/JDK-8219658
> Fix long closing of the socker in ServerImpl (TcpDiscoverySpi)
> --------------------------------------------------------------
>
> Key: IGNITE-13643
> URL: https://issues.apache.org/jira/browse/IGNITE-13643
> Project: Ignite
> Issue Type: Bug
> Reporter: Vladimir Steshin
> Assignee: Vladimir Steshin
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Current IgniteUtils.closeQuiet(@Nullable Socket sock) takes about 5sec to
> close socket. We should include socket linger in failureDetectionTimeout.
> This violates node detection failure. Despite we set failureDetectionTiemout
> == 1000, node failure is detected within 6.5 secs in average. Logging shows
> delay on socket closing in IgniteUtils.closeQuiet(@Nullable Socket sock).
> This time gap was unearther by a discovery integration test on ducktape [1].
> Failure detection timeout is set to 1000ms.
> Typical results before the fix for 1 node:
> "Detection of node(s) failure (ms)": 6140, "All detection delays (ms):":
> "[6140]", "Nodes failed": 1}
> Typical results after the fix for 1 node:
> "Detection of node(s) failure (ms)": 1004, "All detection delays (ms):":
> "[1004]", "Nodes failed": 1}
> Suggestion: use forced closing, set soLinger=0, do now wait for rest of the
> socket IO. We close socket in TcpDiscoverySpi when we already waited for
> target timeouts and consider connection is lost or invalid. We do not need to
> wait for any traffic on the socket any more.
> There is note that 'graceful' socket closing was made to workaround bag in
> OpenJDK12 [1]. But as I see it has been fixed.
> But we should take in account known issues with SSL connection where linger
> might be nesessary.
> [1]
> https://github.com/apache/ignite/blob/ignite-ducktape/modules/ducktests/tests/ignitetest/tests/discovery_test.py
> [2] https://bugs.openjdk.java.net/browse/JDK-8219658
--
This message was sent by Atlassian Jira
(v8.3.4#803005)