[ 
https://issues.apache.org/jira/browse/IGNITE-13643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13643:
--------------------------------------
    Description: 
Current IgniteUtils.closeQuiet(@Nullable Socket sock) can take about 5sec to 
close socket. This violates node detection failure. Despite we set 
failureDetectionTiemout == 1000, node failure is detected within 6.5 secs in 
average. 

This time gap was unearther by a discovery integration test on ducktape [1]. 
Failure detection timeout is set to 1000ms.
Typical results before the fix for 1 node:
"Detection of node(s) failure (ms)": 6140, "All detection delays (ms):": 
"[6140]", "Nodes failed": 1}

Typical results after the fix for 1 node:
"Detection of node(s) failure (ms)": 1034, "All detection delays (ms):": 
"[1034]", "Nodes failed": 1}

There is note that 'graceful' socket closing was made to workaround bag in 
OpenJDK12 [2]. But as I see it has been fixed. Also, there were SSL issues like 
[3] and [4].
There are various fixes in modern versions of various JDK, supporting TLS 1.3 
([6] and [7]). OpenJDK11 does well as far as I know.

I believe, SSL in discovery is rare in usage. This slows down performance. With 
the issues, one could just enable soLiger or update the JDK. There is no reason 
to prolong failure detection by default.

[1] 
https://github.com/apache/ignite/blob/ignite-ducktape/modules/ducktests/tests/ignitetest/tests/discovery_test.py
[2] https://bugs.openjdk.java.net/browse/JDK-8219658
[3] https://issues.apache.org/jira/browse/IGNITE-12818
[4] https://issues.apache.org/jira/browse/IGNITE-11288
[5] https://bugs.openjdk.java.net/browse/JDK-8245468
[6] https://www.oracle.com/java/technologies/javase/8u261-relnotes.html


  was:
Current IgniteUtils.closeQuiet(@Nullable Socket sock) takes about 5sec to close 
socket. We should include socket linger in failureDetectionTimeout. This 
violates node detection failure. Despite we set failureDetectionTiemout == 
1000, node failure is detected within 6.5 secs in average. Logging shows delay 
on socket closing in IgniteUtils.closeQuiet(@Nullable Socket sock).


This time gap was unearther by a discovery integration test on ducktape [1]. 
Failure detection timeout is set to 1000ms.
Typical results before the fix for 1 node:
"Detection of node(s) failure (ms)": 6140, "All detection delays (ms):": 
"[6140]", "Nodes failed": 1}

Typical results after the fix for 1 node:
"Detection of node(s) failure (ms)": 1004, "All detection delays (ms):": 
"[1004]", "Nodes failed": 1}


Suggestion: use forced closing, set soLinger=0, do now wait for rest of the 
socket IO. We close socket in TcpDiscoverySpi when we already waited for target 
timeouts and consider connection is lost or invalid. We do not need to wait for 
any traffic on the socket any more.

There is note that 'graceful' socket closing was made to workaround bag in 
OpenJDK12 [1]. But as I see it has been fixed.
But we should take in account known issues with SSL connection where linger 
might be nesessary.


[1] 
https://github.com/apache/ignite/blob/ignite-ducktape/modules/ducktests/tests/ignitetest/tests/discovery_test.py
[2] https://bugs.openjdk.java.net/browse/JDK-8219658

        Summary: Disable socket linger dy default in TCPDiscoverySpi  (was: Fix 
long closing of the socker in ServerImpl (TcpDiscoverySpi))

> Disable socket linger dy default in TCPDiscoverySpi
> ---------------------------------------------------
>
>                 Key: IGNITE-13643
>                 URL: https://issues.apache.org/jira/browse/IGNITE-13643
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Vladimir Steshin
>            Assignee: Vladimir Steshin
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Current IgniteUtils.closeQuiet(@Nullable Socket sock) can take about 5sec to 
> close socket. This violates node detection failure. Despite we set 
> failureDetectionTiemout == 1000, node failure is detected within 6.5 secs in 
> average. 
> This time gap was unearther by a discovery integration test on ducktape [1]. 
> Failure detection timeout is set to 1000ms.
> Typical results before the fix for 1 node:
> "Detection of node(s) failure (ms)": 6140, "All detection delays (ms):": 
> "[6140]", "Nodes failed": 1}
> Typical results after the fix for 1 node:
> "Detection of node(s) failure (ms)": 1034, "All detection delays (ms):": 
> "[1034]", "Nodes failed": 1}
> There is note that 'graceful' socket closing was made to workaround bag in 
> OpenJDK12 [2]. But as I see it has been fixed. Also, there were SSL issues 
> like [3] and [4].
> There are various fixes in modern versions of various JDK, supporting TLS 1.3 
> ([6] and [7]). OpenJDK11 does well as far as I know.
> I believe, SSL in discovery is rare in usage. This slows down performance. 
> With the issues, one could just enable soLiger or update the JDK. There is no 
> reason to prolong failure detection by default.
> [1] 
> https://github.com/apache/ignite/blob/ignite-ducktape/modules/ducktests/tests/ignitetest/tests/discovery_test.py
> [2] https://bugs.openjdk.java.net/browse/JDK-8219658
> [3] https://issues.apache.org/jira/browse/IGNITE-12818
> [4] https://issues.apache.org/jira/browse/IGNITE-11288
> [5] https://bugs.openjdk.java.net/browse/JDK-8245468
> [6] https://www.oracle.com/java/technologies/javase/8u261-relnotes.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to