[ 
https://issues.apache.org/jira/browse/SAMZA-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marouane RAJI updated SAMZA-2265:
---------------------------------
    Environment: 
 

 

  was:
 

```

job.container.count : 110

yarn.container.memory.mb=4000
yarn.container.cpu.cores=8
yarn.am.container.cpu.cores=8
yarn.am.container.memory.mb=1024
task.opts=-Xmx2800M
task.checkpoint.replication.factor=2

 ```


> Memory leak potentially due to Kafka Checkpoint Management
> ----------------------------------------------------------
>
>                 Key: SAMZA-2265
>                 URL: https://issues.apache.org/jira/browse/SAMZA-2265
>             Project: Samza
>          Issue Type: Bug
>    Affects Versions: 1.0, 1.1
>         Environment:  
>  
>            Reporter: Marouane RAJI
>            Priority: Major
>         Attachments: image-2019-07-01-09-47-11-241.png, 
> image-2019-07-01-09-48-45-876.png, image-2019-07-01-09-50-04-693.png
>
>
> Hi, 
> We recently upgraded one of our high throughput samza jobs from 0.13.1 to 1.0 
> then to 1.1. It seems that in both later versions we would have a memory 
> leak. This ever-increasing memory would lead to containers failing/ yarn 
> restarting them.
> It is worth noticing that we upgraded other smaller (in container specs and 
> throughput) samza jobs without any issues.
> specs about job : 
>  * reading ~70k msg/sec 
>  * 211 input topic , including one broadcasting one (2 msg/day, used for 
> config updates)
>  * 1 output topic.
> Below, memory consumption in both versions for one container
> !image-2019-07-01-09-47-11-241.png!
>  
> Heap-dumps comparison: 
> !image-2019-07-01-09-48-45-876.png!
>  
> The difference between both version keep increasing slowly, the main cause of 
> that in the increase in byte[]
> In the 1.0 and 1.1 version the main reference holding these bytes seems to be 
>  KafkaCheckpointManager: 
> !image-2019-07-01-09-50-04-693.png!
>  
> Could this PR solves this issues [https://github.com/apache/samza/pull/993] ? 
> as, we would be releasing KafkaConsumer used for checkpointing ? 
> Thanks. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to