Saketh Kurnool created FLINK-35830:
--------------------------------------
Summary: Large failed bulk request can result in TM OOM
Key: FLINK-35830
URL: https://issues.apache.org/jira/browse/FLINK-35830
Project: Flink
Issue Type: Bug
Components: Connectors / Opensearch
Reporter: Saketh Kurnool
Attachments: Screenshot 2024-07-12 at 2.50.44 PM.png
The {{BulkResponse}} object contains a single {{Throwable}} per failing
document in the bulk request. The connector currently loops through the
failures, combines them into a single exception via suppression, and throws it.
In cases where the bulk size is very large (>10,000 responses), the size of the
resulting stack trace is so large that serializing it causes the TM to OOM.
Attached is the heap dump visualization of a TM that OOM'ed with a failing bulk
size of >10,000.
I have mitigated this issue in my local fork of the OS connector by only
suppressing exceptions from a bulk response with unique root causes - this way,
we can avoid massively nested stack traces where the root cause of every
failure is the exact same. NOTE that this proposed fix does *not* mitigate the
unlikely case in which every failing document in a very large {{BulkResponse}}
has a different root cause. I believe this is acceptable judging by how
infrequently this would occur, but it is worth revisiting in the future if it
becomes a problem.
@opensearch connector community: let me know if you think this fix would be
valuable - I'm happy to open a PR for this upstream!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)