[ 
https://issues.apache.org/jira/browse/SAMZA-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16887278#comment-16887278
 ] 

Prateek Maheshwari commented on SAMZA-2265:
-------------------------------------------

Marouane, can you confirm that identity (clientId) of the KafkaSystemConsumer 
that's holding on to these messages? If it is indeed the one used by 
CheckpointManager, the commit your mentioned is available in Samza 1.2. Can you 
try upgrading to 1.2 and see if it fixes the issue?

> Memory leak potentially due to Kafka Checkpoint Management
> ----------------------------------------------------------
>
>                 Key: SAMZA-2265
>                 URL: https://issues.apache.org/jira/browse/SAMZA-2265
>             Project: Samza
>          Issue Type: Bug
>    Affects Versions: 1.0, 1.1
>         Environment:  
>  
>            Reporter: Marouane RAJI
>            Priority: Major
>         Attachments: image-2019-07-01-09-47-11-241.png, 
> image-2019-07-01-09-48-45-876.png, image-2019-07-01-09-50-04-693.png
>
>
> Hi, 
> We recently upgraded one of our high throughput samza jobs from 0.13.1 to 1.0 
> then to 1.1. It seems that in both later versions we would have a memory 
> leak. This ever-increasing memory would lead to containers failing/ yarn 
> restarting them.
>  It is worth noticing that we upgraded other smaller (in container specs and 
> throughput) samza jobs without any issues.
> specs about job : 
>  * reading ~70k msg/sec 
>  * 211 input topic , including one broadcasting one (2 msg/day, used for 
> config updates)
>  * 1 output topic.
> ```
> job.container.count : 110
> yarn.container.memory.mb=4000
>  yarn.container.cpu.cores=8
>  yarn.am.container.cpu.cores=8
>  yarn.am.container.memory.mb=1024
>  task.opts=-Xmx2800M
>  task.checkpoint.replication.factor=2
>  ```
> Below, memory consumption in both versions for one container
> !image-2019-07-01-09-47-11-241.png!
>  
> Heap-dumps comparison: 
> !image-2019-07-01-09-48-45-876.png!
>  
> The difference between both version keep increasing slowly, the main cause of 
> that in the increase in byte[]
> In the 1.0 and 1.1 version the main reference holding these bytes seems to be 
>  KafkaCheckpointManager: 
>  !image-2019-07-01-09-50-04-693.png!
>  
> We have found this PR that should be deployed in 1.1 
> [https://github.com/apache/samza/pull/993], not sure if it can be related to 
> this ?
> Thanks. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to