[
https://issues.apache.org/jira/browse/FLINK-35830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Saketh Kurnool updated FLINK-35830:
-----------------------------------
Description:
The {{BulkResponse}} object contains a single {{Throwable}} per failing
document in the bulk request. The connector currently loops through the
failures, combines them into a single exception via suppression, and throws it.
In cases where the bulk size is very large, the size of the resulting stack
trace is so large that serializing it causes the TM to OOM. Attached is the
heap dump visualization of a TM that OOM'ed with a failing bulk size of 1,000.
I have mitigated this issue in my local fork of the OS connector by only
suppressing exceptions from a bulk response with unique root causes - this way,
we can avoid massively nested stack traces where the root cause of every
failure is the exact same. NOTE that this proposed fix does *not* mitigate the
unlikely case in which every failing document in a very large {{BulkResponse}}
has a different root cause. I believe this is acceptable judging by how
infrequently this would occur, but it is worth revisiting in the future if it
becomes a problem.
@opensearch connector community: let me know if you think this fix would be
valuable - I'm happy to open a PR for this upstream!
was:
The {{BulkResponse}} object contains a single {{Throwable}} per failing
document in the bulk request. The connector currently loops through the
failures, combines them into a single exception via suppression, and throws it.
In cases where the bulk size is very large (>1,000 responses), the size of the
resulting stack trace is so large that serializing it causes the TM to OOM.
Attached is the heap dump visualization of a TM that OOM'ed with a failing bulk
size of 1,000.
I have mitigated this issue in my local fork of the OS connector by only
suppressing exceptions from a bulk response with unique root causes - this way,
we can avoid massively nested stack traces where the root cause of every
failure is the exact same. NOTE that this proposed fix does *not* mitigate the
unlikely case in which every failing document in a very large {{BulkResponse}}
has a different root cause. I believe this is acceptable judging by how
infrequently this would occur, but it is worth revisiting in the future if it
becomes a problem.
@opensearch connector community: let me know if you think this fix would be
valuable - I'm happy to open a PR for this upstream!
> Large failed bulk request can result in TM OOM
> ----------------------------------------------
>
> Key: FLINK-35830
> URL: https://issues.apache.org/jira/browse/FLINK-35830
> Project: Flink
> Issue Type: Bug
> Components: Connectors / Opensearch
> Reporter: Saketh Kurnool
> Priority: Major
> Attachments: Screenshot 2024-07-12 at 2.50.44 PM.png
>
>
> The {{BulkResponse}} object contains a single {{Throwable}} per failing
> document in the bulk request. The connector currently loops through the
> failures, combines them into a single exception via suppression, and throws
> it. In cases where the bulk size is very large, the size of the resulting
> stack trace is so large that serializing it causes the TM to OOM. Attached is
> the heap dump visualization of a TM that OOM'ed with a failing bulk size of
> 1,000.
> I have mitigated this issue in my local fork of the OS connector by only
> suppressing exceptions from a bulk response with unique root causes - this
> way, we can avoid massively nested stack traces where the root cause of every
> failure is the exact same. NOTE that this proposed fix does *not* mitigate
> the unlikely case in which every failing document in a very large
> {{BulkResponse}} has a different root cause. I believe this is acceptable
> judging by how infrequently this would occur, but it is worth revisiting in
> the future if it becomes a problem.
> @opensearch connector community: let me know if you think this fix would be
> valuable - I'm happy to open a PR for this upstream!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)