[
https://issues.apache.org/jira/browse/SAMZA-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16887838#comment-16887838
]
Marouane RAJI commented on SAMZA-2265:
--------------------------------------
Hi Prateek,
Thanks for the answer.
I haven't done a thorough analysis on this. But, it seems like quite a few of
them are referenced by checkpointManager.
!image-2019-07-18-11-09-46-217.png!
We suspect that the change in the commit might solve our issue. We were not
sure initially, that it was introduced in 1.2. We will upgrade and feedback
here.
Let me know if you need any more details ? or the heapdumps..
Thanks,
> Memory leak potentially due to Kafka Checkpoint Management
> ----------------------------------------------------------
>
> Key: SAMZA-2265
> URL: https://issues.apache.org/jira/browse/SAMZA-2265
> Project: Samza
> Issue Type: Bug
> Affects Versions: 1.0, 1.1
> Environment:
>
> Reporter: Marouane RAJI
> Priority: Major
> Attachments: image-2019-07-01-09-47-11-241.png,
> image-2019-07-01-09-48-45-876.png, image-2019-07-01-09-50-04-693.png,
> image-2019-07-18-11-09-46-217.png
>
>
> Hi,
> We recently upgraded one of our high throughput samza jobs from 0.13.1 to 1.0
> then to 1.1. It seems that in both later versions we would have a memory
> leak. This ever-increasing memory would lead to containers failing/ yarn
> restarting them.
> It is worth noticing that we upgraded other smaller (in container specs and
> throughput) samza jobs without any issues.
> specs about job :
> * reading ~70k msg/sec
> * 211 input topic , including one broadcasting one (2 msg/day, used for
> config updates)
> * 1 output topic.
> ```
> job.container.count : 110
> yarn.container.memory.mb=4000
> yarn.container.cpu.cores=8
> yarn.am.container.cpu.cores=8
> yarn.am.container.memory.mb=1024
> task.opts=-Xmx2800M
> task.checkpoint.replication.factor=2
> ```
> Below, memory consumption in both versions for one container
> !image-2019-07-01-09-47-11-241.png!
>
> Heap-dumps comparison:
> !image-2019-07-01-09-48-45-876.png!
>
> The difference between both version keep increasing slowly, the main cause of
> that in the increase in byte[]
> In the 1.0 and 1.1 version the main reference holding these bytes seems to be
> KafkaCheckpointManager:
> !image-2019-07-01-09-50-04-693.png!
>
> We have found this PR that should be deployed in 1.1
> [https://github.com/apache/samza/pull/993], not sure if it can be related to
> this ?
> Thanks.
>
>
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)