[
https://issues.apache.org/jira/browse/TEZ-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rajesh Balamohan updated TEZ-776:
---------------------------------
Attachment: with_patch_jmc_output_of_AM.png
Without_patch_AM_CPU_Usage.png
without_patch_jmc_output_of_AM.png
Tried out the patch with a large synthetic job in the cluster.
Source (200 tasks) --> IntermediateVertex1 (20,000 tasks) -->
IntermediateVertex2 (20,000 tasks) --> Output (10,000 tasks)
All edges are of type ScatterGather and job was run with 4 GB containers.
- Without the patch, AM died within 7 minutes due to OOM (expected scenario).
I have attached the JMC output here. Note that till first 5 minutes or so, AM
(JVM CPU usage in the attachment) was just around 5%. Once GC pressure (due to
IntermediateVertex1 started sending events to IntermediateVertex2) starts
happening "JVM CPU Usage" spikes up heavily and remains there until AM dies.
"top" output for AM shows "1600%" on 24 cpu node, which is close to 66%. It
matches with the JMC JVM CPU Usage.
- With the patch, AM completed IntermediateVertex1 --> IntermediateVertex2 (i.e
20K x 20K tasks) successfully. But when I tried to profile via JMC (last 1
minute of the diagram), AM crashed due to JMC issue. However, "JVM CPU Usage"
was around 5% till Source --> IntermediateVertex1 got over & spiked up to 10%
for the transfer from IntermediateVertex1 --> IntermediateVertex2. There is no
significant overhead in CPU and Memory usage was around 1-1.2GB which is good.
I need to get one more run with profiler, to understand expensive methods
during the transfer from IntermediateVertex1 --> IntermediateVertex2.
> Reduce AM mem usage caused by storing TezEvents
> -----------------------------------------------
>
> Key: TEZ-776
> URL: https://issues.apache.org/jira/browse/TEZ-776
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Siddharth Seth
> Assignee: Bikas Saha
> Attachments: TEZ-776.ondemand.1.patch, TEZ-776.ondemand.2.patch,
> TEZ-776.ondemand.3.patch, TEZ-776.ondemand.4.patch, TEZ-776.ondemand.5.patch,
> TEZ-776.ondemand.patch, Without_patch_AM_CPU_Usage.png,
> events-problem-solutions.txt, with_patch_jmc_output_of_AM.png,
> without_patch_jmc_output_of_AM.png
>
>
> This is open ended at the moment.
> A fair chunk of the AM heap is taken up by TezEvents (specifically
> DataMovementEvents - 64 bytes per event).
> Depending on the connection pattern - this puts limits on the number of tasks
> that can be processed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)