[jira] [Comment Edited] (FLINK-34325) Inconsistent state with data loss after OutOfMemoryError in Job Manager

Alexis Sarda-Espinosa (Jira) Wed, 31 Jan 2024 07:47:05 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-34325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17812760#comment-17812760
 ]


Alexis Sarda-Espinosa edited comment on FLINK-34325 at 1/31/24 3:46 PM:
------------------------------------------------------------------------

I just did a quick test where I increased the memory for the TM without 
changing the JM, and I could definitely load more metadata into the cache 
before it crashed. I wonder if the reported OOM error is actually from the TM 
and it's just reported by the JM?

EDIT: I think this is the case, I was logging a lot in the TM so I didn't see 
it initially, but I found the exception and its stacktrace in the TM logs after 
all.


was (Author: asardaes):
I just did a quick test where I increased the memory for the TM without 
changing the JM, and I could definitely load more metadata into the cache 
before it crashed. I wonder if the reported OOM error is actually from the TM 
and it's just reported by the JM?

> Inconsistent state with data loss after OutOfMemoryError in Job Manager
> -----------------------------------------------------------------------
>
>                 Key: FLINK-34325
>                 URL: https://issues.apache.org/jira/browse/FLINK-34325
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 1.17.1
>         Environment: Flink on Kubernetes with HA, RocksDB with incremental 
> checkpoints on Azure
>            Reporter: Alexis Sarda-Espinosa
>            Priority: Major
>         Attachments: jobmanager_log.txt
>
>
> I have a job that uses broadcast state to maintain a cache of required 
> metadata. I am currently evaluating memory requirements of my specific use 
> case, and I ran into a weird situation that seems worrisome.
> All sources in my job are Kafka sources. I wrote a large amount of messages 
> in Kafka to force the broadcast state's cache to grow. At some point, this 
> caused an "{{java.lang.OutOfMemoryError: Java heap space}}" error in the Job 
> Manager. I would have expected the whole java process of the JM to crash, but 
> the job was simply restarted. What's worrisome is that, after 2 restarts, the 
> job resumed from the latest successful checkpoint and completely ignored all 
> the events I wrote to Kafka, which I can verify because I have a custom 
> metric that exposes the approximate size of this cache, and the fact that the 
> job didn't crashloop at this point after reading all the messages from Kafka 
> over and over again.
> I'm attaching an excerpt of the Job Manager's logs. My main concerns are:
> # It seems the memory error from the JM didn't prevent the Kafka offsets from 
> being "rolled back", so eventually the Kafka events that should have ended in 
> the broadcast state's cache were ignored.
> # Is it normal that the state is somehow "materialized" in the JM and is thus 
> affected by the size of the JM's heap? Is this something particular due to 
> the use of broadcast state? I found this very surprising.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-34325) Inconsistent state with data loss after OutOfMemoryError in Job Manager

Reply via email to