Saketh Kurnool created FLINK-35830:
--------------------------------------

             Summary: Large failed bulk request can result in TM OOM
                 Key: FLINK-35830
                 URL: https://issues.apache.org/jira/browse/FLINK-35830
             Project: Flink
          Issue Type: Bug
          Components: Connectors / Opensearch
            Reporter: Saketh Kurnool
         Attachments: Screenshot 2024-07-12 at 2.50.44 PM.png

The {{BulkResponse}} object contains a single {{Throwable}} per failing 
document in the bulk request. The connector currently loops through the 
failures, combines them into a single exception via suppression, and throws it. 
In cases where the bulk size is very large (>10,000 responses), the size of the 
resulting stack trace is so large that serializing it causes the TM to OOM. 
Attached is the heap dump visualization of a TM that OOM'ed with a failing bulk 
size of >10,000.

I have mitigated this issue in my local fork of the OS connector by only 
suppressing exceptions from a bulk response with unique root causes - this way, 
we can avoid massively nested stack traces where the root cause of every 
failure is the exact same. NOTE that this proposed fix does *not* mitigate the 
unlikely case in which every failing document in a very large {{BulkResponse}} 
has a different root cause. I believe this is acceptable judging by how 
infrequently this would occur, but it is worth revisiting in the future if it 
becomes a problem.

@opensearch connector community: let me know if you think this fix would be 
valuable - I'm happy to open a PR for this upstream!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to