Jark Wu created FLINK-24607:
-------------------------------
Summary: SourceCoordinator may miss to close SplitEnumerator when
failover frequently
Key: FLINK-24607
URL: https://issues.apache.org/jira/browse/FLINK-24607
Project: Flink
Issue Type: Bug
Components: Runtime / Coordination
Affects Versions: 1.13.3
Reporter: Jark Wu
Attachments: jobmanager.log
We are having a connection leak problem when using mysql-cdc [1] source. We
observed that many enumerators are not closed from the JM log.
{code}
➜ test123 cat jobmanager.log | grep "SourceCoordinator \[\] - Restoring
SplitEnumerator" | wc -l
264
➜ test123 cat jobmanager.log | grep "SourceCoordinator \[\] - Starting split
enumerator" | wc -l
264
➜ test123 cat jobmanager.log | grep "MySqlSourceEnumerator \[\] - Starting
enumerator" | wc -l
263
➜ test123 cat jobmanager.log | grep "SourceCoordinator \[\] - Closing
SourceCoordinator" | wc -l
264
➜ test123 cat jobmanager.log | grep "MySqlSourceEnumerator \[\] - Closing
enumerator" | wc -l
195
{code}
We added "Closing enumerator" log in {{MySqlSourceEnumerator#close()}}, and
"Starting enumerator" in {{MySqlSourceEnumerator#start()}}. From the above
result you can see that SourceCoordinator is restored and closed 264 times,
split enumerator is started 264 but only closed 195 times. It seems that
{{SourceCoordinator}} misses to close enumerator when job failover frequently.
I also went throught the code of {{SourceCoordinator}} and found some
suspicious point:
The {{started}} flag and {{enumerator}} is assigned in the main thread,
however {{SourceCoordinator#close()}} is executed async by
{{DeferrableCoordinator#closeAsync}}. That means the close method will check
the {{started}} and {{enumerator}} variable async. Is there any concurrency
problem here which mean lead to dirty read and miss to close the
{{enumerator}}?
I'm still not sure, because it's hard to reproduce locally, and we can't deploy
a custom flink version to production env.
[1]: https://github.com/ververica/flink-cdc-connectors
--
This message was sent by Atlassian Jira
(v8.3.4#803005)