Hey Lukas, As a pre-amble, I have to say, if you consider 200MB of memory usage an incredibly large amount of memory, you're probably either working with the wrong system, or worrying about optimizing the wrong thing. Your SamzaContainers are likely not going to be able to run without a few hundred megabytes of space. All of our containers run with at least 1G, and the AM becomes completely negligible compared to the total amount of resources a job uses.
The default for the AM and the SamzaContainer are both: -Xmx768M 1000MB containers This means that YARN will kill your process (AM or SamzaContainer) if it goes over the 1G limit, and a container will OOME if it goes over 768MB of heap usage. First, I'll address the AM's heap. There are two main reasons why we want a 768MB heap. * The AM runs a Scalatra webapp, which requires significant heap when it runs. We tried other -Xmx settings, but 768 seemed to be the lowest stable setting for all jobs. * Samza's core code is implemented in Scala, which can bloat the JVM. A quick glance shows about 12% of heap used for random scala.reflect classes. The 1G container limit (vs. 768MB heap) is to give the AM extra space for things like: * perm gen * off-heap space * page cache * thread stacks > Is this behavior common or could this be some misconfiguration? It is common. I took a look at some of our jobs. They're running between 150MB and 250MB in steady state. When I load the AM webpage, the heap spikes up to ~300MB. > As I understand, one of the problems is that each container has it¹s own >VM instance and has to load all the libraries. Could there be some other >issues? There is a little bit of inefficiency from this, but it should be negligible. The 200MB of heap usage that you're seeing are actual objects being used by the AM. Don't forget that the AM is running a YARN client, a web service, a MetricsReporter, etc. If you're unhappy with the amount of memory that the AM is taking up, the first thing that you can do is to tune these two settings: yarn.am.opts (to set -Xmx) yarn.am.container.memory.mb (to lower YARN container memory mb) You can experiment to see how low you can get the heap and container settings. Cheers, Chris On 9/18/14 10:13 AM, "Lukas Steiblys" <[email protected]> wrote: >Hello, > >I¹m trying to use Samza for our new data processing pipeline using YARN >for job scheduling and I¹ve noticed that it consumes an incredibly large >amount of memory. Running the Application Master, that should be a very >lightweight application in my opinion, consumes around ~1.4GB of virtual >memory and ~200MB of physical memory. Same goes for the actual tasks. > >Is this behavior common or could this be some misconfiguration? As I >understand, one of the problems is that each container has it¹s own VM >instance and has to load all the libraries. Could there be some other >issues? Maybe it¹s possible to actually split the application master >package from the task package so it¹s more lightweight? > >Lukas
