[ 
https://issues.apache.org/jira/browse/FLINK-35830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saketh Kurnool updated FLINK-35830:
-----------------------------------
    Description: 
The {{BulkResponse}} object contains a single {{Throwable}} per failing 
document in the bulk request. The connector currently loops through the 
failures, combines them into a single exception via suppression, and throws it. 
In cases where the bulk size is very large, the size of the resulting stack 
trace is so large that serializing it causes the TM to OOM. Attached is the 
heap dump visualization of a TM that OOM'ed with a failing bulk size of 1,000.

I have mitigated this issue in my local fork of the OS connector by only 
suppressing exceptions from a bulk response with unique root causes - this way, 
we can avoid massively nested stack traces where the root cause of every 
failure is the exact same. NOTE that this proposed fix does *not* mitigate the 
unlikely case in which every failing document in a very large {{BulkResponse}} 
has a different root cause. I believe this is acceptable judging by how 
infrequently this would occur, but it is worth revisiting in the future if it 
becomes a problem.

@opensearch connector community: let me know if you think this fix would be 
valuable - I'm happy to open a PR for this upstream!

  was:
The {{BulkResponse}} object contains a single {{Throwable}} per failing 
document in the bulk request. The connector currently loops through the 
failures, combines them into a single exception via suppression, and throws it. 
In cases where the bulk size is very large (>1,000 responses), the size of the 
resulting stack trace is so large that serializing it causes the TM to OOM. 
Attached is the heap dump visualization of a TM that OOM'ed with a failing bulk 
size of 1,000.

I have mitigated this issue in my local fork of the OS connector by only 
suppressing exceptions from a bulk response with unique root causes - this way, 
we can avoid massively nested stack traces where the root cause of every 
failure is the exact same. NOTE that this proposed fix does *not* mitigate the 
unlikely case in which every failing document in a very large {{BulkResponse}} 
has a different root cause. I believe this is acceptable judging by how 
infrequently this would occur, but it is worth revisiting in the future if it 
becomes a problem.

@opensearch connector community: let me know if you think this fix would be 
valuable - I'm happy to open a PR for this upstream!


> Large failed bulk request can result in TM OOM
> ----------------------------------------------
>
>                 Key: FLINK-35830
>                 URL: https://issues.apache.org/jira/browse/FLINK-35830
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / Opensearch
>            Reporter: Saketh Kurnool
>            Priority: Major
>         Attachments: Screenshot 2024-07-12 at 2.50.44 PM.png
>
>
> The {{BulkResponse}} object contains a single {{Throwable}} per failing 
> document in the bulk request. The connector currently loops through the 
> failures, combines them into a single exception via suppression, and throws 
> it. In cases where the bulk size is very large, the size of the resulting 
> stack trace is so large that serializing it causes the TM to OOM. Attached is 
> the heap dump visualization of a TM that OOM'ed with a failing bulk size of 
> 1,000.
> I have mitigated this issue in my local fork of the OS connector by only 
> suppressing exceptions from a bulk response with unique root causes - this 
> way, we can avoid massively nested stack traces where the root cause of every 
> failure is the exact same. NOTE that this proposed fix does *not* mitigate 
> the unlikely case in which every failing document in a very large 
> {{BulkResponse}} has a different root cause. I believe this is acceptable 
> judging by how infrequently this would occur, but it is worth revisiting in 
> the future if it becomes a problem.
> @opensearch connector community: let me know if you think this fix would be 
> valuable - I'm happy to open a PR for this upstream!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to