[ 
https://issues.apache.org/jira/browse/IGNITE-13643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Steshin updated IGNITE-13643:
--------------------------------------
    Description: 
Current IgniteUtils.closeQuiet(@Nullable Socket sock) takes about 5sec to close 
socket. Probably it is default soTimeout. This violates node detection failure. 
Despite we set failureDetectionTiemout == 1000, node failure is detected within 
6.5 secs in average. Logging shows delay on socket closing in 
IgniteUtils.closeQuiet(@Nullable Socket sock).


This time gap was unearther by a discovery integration test on ducktape [1]. 
Failure detection timeout is set to 1000ms.
Typical results before the fix for 1 node:
"Detection of node(s) failure (ms)": 6140, "All detection delays (ms):": 
"[6140]", "Nodes failed": 1}

Typical results after the fix for 1 node:
"Detection of node(s) failure (ms)": 1004, "All detection delays (ms):": 
"[1004]", "Nodes failed": 1}


Suggestion: use forced closing, set soLinger=0, do now wait for rest of the 
socket IO. We close socket in TcpDiscoverySpi when we already waited for target 
timeouts and consider connection is lost or invalid. We do not need to wait for 
any traffic on the socket any more.

There is note that 'graceful' socket closing was made to workaround bag in 
OpenJDK12 [1]. But as I see it has been fixed.
But we should take in account known issues with SSL connection where linger 
might be nesessary.


[1] 
https://github.com/apache/ignite/blob/ignite-ducktape/modules/ducktests/tests/ignitetest/tests/discovery_test.py
[2] https://bugs.openjdk.java.net/browse/JDK-8219658

  was:
Current IgniteUtils.closeQuiet(@Nullable Socket sock) takes about 5sec to close 
socket. Probably it is default soTimeout. This violates node detection failure. 
Despite we set failureDetectionTiemout == 1000, node failure is detected within 
6.5 secs in average. Logging shows delay on socket closing in 
IgniteUtils.closeQuiet(@Nullable Socket sock).


This time gap was unearther by a discovery integration test on ducktape [1]. 
Failure detection timeout is set to 1000ms.
Typical results before the fix for 1 node:
"Detection of node(s) failure (ms)": 6140, "All detection delays (ms):": 
"[6140]", "Nodes failed": 1}

Typical results after the fix for 1 node:
"Detection of node(s) failure (ms)": 1004, "All detection delays (ms):": 
"[1004]", "Nodes failed": 1}


Suggestion: use forced closing, set soLinger=0, do now wait for rest of the 
socket IO. We close socket in TcpDiscoverySpi when we already waited for target 
timeouts and consider connection is lost or invalid. We do not need to wait for 
any traffic on the socket any more.

There is note that 'graceful' socket closing was made to workaround bag in 
OpenJDK12 [1]. But as I see it has been fixed.


[1] 
https://github.com/apache/ignite/blob/ignite-ducktape/modules/ducktests/tests/ignitetest/tests/discovery_test.py
[2] https://bugs.openjdk.java.net/browse/JDK-8219658


> Fix long closing of the socker in ServerImpl (TcpDiscoverySpi)
> --------------------------------------------------------------
>
>                 Key: IGNITE-13643
>                 URL: https://issues.apache.org/jira/browse/IGNITE-13643
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Vladimir Steshin
>            Assignee: Vladimir Steshin
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Current IgniteUtils.closeQuiet(@Nullable Socket sock) takes about 5sec to 
> close socket. Probably it is default soTimeout. This violates node detection 
> failure. Despite we set failureDetectionTiemout == 1000, node failure is 
> detected within 6.5 secs in average. Logging shows delay on socket closing in 
> IgniteUtils.closeQuiet(@Nullable Socket sock).
> This time gap was unearther by a discovery integration test on ducktape [1]. 
> Failure detection timeout is set to 1000ms.
> Typical results before the fix for 1 node:
> "Detection of node(s) failure (ms)": 6140, "All detection delays (ms):": 
> "[6140]", "Nodes failed": 1}
> Typical results after the fix for 1 node:
> "Detection of node(s) failure (ms)": 1004, "All detection delays (ms):": 
> "[1004]", "Nodes failed": 1}
> Suggestion: use forced closing, set soLinger=0, do now wait for rest of the 
> socket IO. We close socket in TcpDiscoverySpi when we already waited for 
> target timeouts and consider connection is lost or invalid. We do not need to 
> wait for any traffic on the socket any more.
> There is note that 'graceful' socket closing was made to workaround bag in 
> OpenJDK12 [1]. But as I see it has been fixed.
> But we should take in account known issues with SSL connection where linger 
> might be nesessary.
> [1] 
> https://github.com/apache/ignite/blob/ignite-ducktape/modules/ducktests/tests/ignitetest/tests/discovery_test.py
> [2] https://bugs.openjdk.java.net/browse/JDK-8219658



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to