[
https://issues.apache.org/jira/browse/FLINK-35830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Saketh Kurnool updated FLINK-35830:
-----------------------------------
Description:
The {{BulkResponse}} object contains a single {{Throwable}} per failing
document in the bulk request. The connector currently loops through the
failures, combines them into a single exception via suppression, and throws it.
In cases where the bulk size is very large (>10,000 responses), the size of the
resulting stack trace is so large that serializing it causes the TM to OOM.
Attached is the heap dump visualization of a TM that OOM'ed with a failing bulk
size of >10,000.
I have mitigated this issue in my local fork of the OS connector by only
suppressing exceptions from each bulk response with unique root causes - this
way, we can avoid massively nested stack traces where the root cause of every
failure is the exact same. NOTE that this proposed fix does *not* mitigate the
unlikely case in which every failing document in a very large {{BulkResponse}}
has a different root cause. I believe this is acceptable judging by how
infrequently this would occur, but it is worth revisiting in the future if it
becomes a problem.
@opensearch connector community: let me know if you think this fix would be
valuable - I'm happy to open a PR for this upstream!
was:
The {{BulkResponse}} object contains a single {{Throwable}} per failing
document in the bulk request. The connector currently loops through the
failures, combines them into a single exception via suppression, and throws it.
In cases where the bulk size is very large (>10,000 responses), the size of the
resulting stack trace is so large that serializing it causes the TM to OOM.
Attached is the heap dump visualization of a TM that OOM'ed with a failing bulk
size of >10,000.
I have mitigated this issue in my local fork of the OS connector by only
suppressing exceptions from a bulk response with unique root causes - this way,
we can avoid massively nested stack traces where the root cause of every
failure is the exact same. NOTE that this proposed fix does *not* mitigate the
unlikely case in which every failing document in a very large {{BulkResponse}}
has a different root cause. I believe this is acceptable judging by how
infrequently this would occur, but it is worth revisiting in the future if it
becomes a problem.
@opensearch connector community: let me know if you think this fix would be
valuable - I'm happy to open a PR for this upstream!
> Large failed bulk request can result in TM OOM
> ----------------------------------------------
>
> Key: FLINK-35830
> URL: https://issues.apache.org/jira/browse/FLINK-35830
> Project: Flink
> Issue Type: Bug
> Components: Connectors / Opensearch
> Reporter: Saketh Kurnool
> Priority: Major
> Attachments: Screenshot 2024-07-12 at 2.50.44 PM.png
>
>
> The {{BulkResponse}} object contains a single {{Throwable}} per failing
> document in the bulk request. The connector currently loops through the
> failures, combines them into a single exception via suppression, and throws
> it. In cases where the bulk size is very large (>10,000 responses), the size
> of the resulting stack trace is so large that serializing it causes the TM to
> OOM. Attached is the heap dump visualization of a TM that OOM'ed with a
> failing bulk size of >10,000.
> I have mitigated this issue in my local fork of the OS connector by only
> suppressing exceptions from each bulk response with unique root causes - this
> way, we can avoid massively nested stack traces where the root cause of every
> failure is the exact same. NOTE that this proposed fix does *not* mitigate
> the unlikely case in which every failing document in a very large
> {{BulkResponse}} has a different root cause. I believe this is acceptable
> judging by how infrequently this would occur, but it is worth revisiting in
> the future if it becomes a problem.
> @opensearch connector community: let me know if you think this fix would be
> valuable - I'm happy to open a PR for this upstream!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)