[jira] [Updated] (FLINK-35830) Large failed bulk request can result in TM OOM

Saketh Kurnool (Jira) Fri, 12 Jul 2024 15:28:06 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-35830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Saketh Kurnool updated FLINK-35830:
-----------------------------------
    Description: 
The {{BulkResponse}} object contains a single {{Throwable}} per failing 
document in the bulk request. The connector currently loops through the 
failures, combines them into a single exception via suppression, and throws it. 
In cases where the bulk size is very large (>10,000 responses), the size of the 
resulting stack trace is so large that serializing it causes the TM to OOM. 
Attached is the heap dump visualization of a TM that OOM'ed with a failing bulk 
size of >10,000.

I have mitigated this issue in my local fork of the OS connector by only 
suppressing exceptions from each bulk response with unique root causes - this 
way, we can avoid massively nested stack traces where the root cause of every 
failure is the exact same. NOTE that this proposed fix does *not* mitigate the 
unlikely case in which every failing document in a very large {{BulkResponse}} 
has a different root cause. I believe this is acceptable judging by how 
infrequently this would occur, but it is worth revisiting in the future if it 
becomes a problem.

@opensearch connector community: let me know if you think this fix would be 
valuable - I'm happy to open a PR for this upstream!

  was:
The {{BulkResponse}} object contains a single {{Throwable}} per failing 
document in the bulk request. The connector currently loops through the 
failures, combines them into a single exception via suppression, and throws it. 
In cases where the bulk size is very large (>10,000 responses), the size of the 
resulting stack trace is so large that serializing it causes the TM to OOM. 
Attached is the heap dump visualization of a TM that OOM'ed with a failing bulk 
size of >10,000.

I have mitigated this issue in my local fork of the OS connector by only 
suppressing exceptions from a bulk response with unique root causes - this way, 
we can avoid massively nested stack traces where the root cause of every 
failure is the exact same. NOTE that this proposed fix does *not* mitigate the 
unlikely case in which every failing document in a very large {{BulkResponse}} 
has a different root cause. I believe this is acceptable judging by how 
infrequently this would occur, but it is worth revisiting in the future if it 
becomes a problem.

@opensearch connector community: let me know if you think this fix would be 
valuable - I'm happy to open a PR for this upstream!


> Large failed bulk request can result in TM OOM
> ----------------------------------------------
>
>                 Key: FLINK-35830
>                 URL: https://issues.apache.org/jira/browse/FLINK-35830
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / Opensearch
>            Reporter: Saketh Kurnool
>            Priority: Major
>         Attachments: Screenshot 2024-07-12 at 2.50.44 PM.png
>
>
> The {{BulkResponse}} object contains a single {{Throwable}} per failing 
> document in the bulk request. The connector currently loops through the 
> failures, combines them into a single exception via suppression, and throws 
> it. In cases where the bulk size is very large (>10,000 responses), the size 
> of the resulting stack trace is so large that serializing it causes the TM to 
> OOM. Attached is the heap dump visualization of a TM that OOM'ed with a 
> failing bulk size of >10,000.
> I have mitigated this issue in my local fork of the OS connector by only 
> suppressing exceptions from each bulk response with unique root causes - this 
> way, we can avoid massively nested stack traces where the root cause of every 
> failure is the exact same. NOTE that this proposed fix does *not* mitigate 
> the unlikely case in which every failing document in a very large 
> {{BulkResponse}} has a different root cause. I believe this is acceptable 
> judging by how infrequently this would occur, but it is worth revisiting in 
> the future if it becomes a problem.
> @opensearch connector community: let me know if you think this fix would be 
> valuable - I'm happy to open a PR for this upstream!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-35830) Large failed bulk request can result in TM OOM

Reply via email to