I have encountered the following problem, wonder if anyone has any idea.
Problem
Lsof for the ODL process shows tons of CLOSE_WAIT connections when OVS node
repeatedly fails to reconnect to ODL controller port . Eventually ODL throws
"Too many open files" as the of CLOSE_WAIT connections piled up and exceeded
the maximum allowed number of file descriptors.
This problem only happens when we enable TRACE for ODL logging during
scalability testing
Reproduction step:
1) One control node, one ODL node, using boron
2) On the ODL enable TRACE level logging for netvirt, openflowplugin,
openflowjava
3) From control node, use script to define 100 networks/subnetworks in a
loop.
4) At around 20-50th network creation, OVS starts to disconnect to ODL
openflowplugin port due to inactivity probe. The inactivity from ODL might due
to the fact that ODL spends most of its time in logging activity (see step 2).
This problem doesn't happen if we don't enable TRACE level logging
5) Subsequently, OVS repeatedly tried to reconnect to ODL openflowplugin
port without success
6) Step 4) 5) 6) repeats for about half an hour, with no CLOSE_WAIT
connections appear in the lsof result. ODL's karaf shows normal log entries for
connections being established and closed.
7) After some failed reconnection attempts as mentioned in 6) the
subsequent connection attempts result in CLOSE_WAIT connections as shown in
lsof:
java 10407 odluser 383u IPv6 77653 0t0 TCP
odl2.c.my-odl.internal:6653->control2.c.my-odl.internal:40232 (CLOSE_WAIT)
java 10407 odluser 401u IPv6 79949 0t0 TCP
odl2.c.my-odl.internal:6653->control2.c.my-odl.internal:40236 (CLOSE_WAIT)
.....
When 7) for the above steps happens, the following are observed:
1) The cluster service didn't send Ownership changes for the last
disconnection.
2) When new connection arrives, LifecycleServiceImpl calls
ClusterSingletonServiceRegistration#registerClusterSingletonService. Here it
doesn't call serviceGroup.initializationClusterSingletonGroup() since the
serviceGroup was not cleaned up properly because of 1)
3) Then LifecycleServiceImpl calls
ClusterSingletonServiceGroupImpl.registerService.
4) The call in 3) hangs at:
LifecycleServiceImpl::instantiateServiceInstance
DeviceContextImpl:: onContextInstantiateService
DeviceInitializationUtils.initializeNodeInformation
DeviceInitializationUtils
.createDeviceFeaturesForOF13(deviceContext, switchFeaturesMandatory,
convertorExecutor).get(); <---
5) The call in 3) is invoked in a netty worker thread that handle I/O for
ODL.
6) The call in 3) holds a Semaphore on
ClusterSingletonServiceGroupImpl::clusterLock
7) Now subsequent incoming requests from the same OVS also result in
invoking of ClusterSingletonServiceGroupImpl.registerService
8) The requests in 7) will hang forever waiting for the Semaphore on
ClusterSingletonServiceGroupImpl::clusterLock (being locked in 6)
9) The requests in 7 also are handled by netty worker threads
10) As reconnection requests keep coming eventually netty runs out of worker
thread to handle new connections
11) Subsequent incoming connections and closing connections from OVS result
in CLOSE_WAIT connections since netty has no more thread to handle them
Sorry the long email, but does anyone has any idea on the issue or has
encountered similar issues? Some questions that I am trying to understand below:
1) Why cluster service didn't send Ownership changes as in 1)
2) Semaphore (in 6) and blocked call (in 4) in netty I/O worker threads
might lead to bad situations like this, can they be avoided?
Thanks, Vinh
::DISCLAIMER::
----------------------------------------------------------------------------------------------------------------------------------------------------
The contents of this e-mail and any attachment(s) are confidential and intended
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction,
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and
other defects.
----------------------------------------------------------------------------------------------------------------------------------------------------
_______________________________________________
openflowplugin-dev mailing list
[email protected]
https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev