[ 
https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292913#comment-14292913
 ] 

Sven Krasser commented on SPARK-5395:
-------------------------------------

Some additional findings from my side: I've managed to trigger the problem 
using a simpler job on production data that basically does a reduceByKey 
followed by a count action. I get >20 workers (2 cores per executor) before any 
tasks in the first stage (reduceByKey) complete (i.e. different from the stage 
transition behavior you noticed). However, this doesn't occur if I run over a 
smaller data set, i.e. fewer production data files.

Before calling reduceByKey I have a coalesce call. Without that the error does 
not occur (at least in this smaller script). This at first glance looked 
potentially spilling related (more data per task), but attempting to force 
spills by setting the worker memory very low did not help with my attempts to 
get a repro on test data.

> Large number of Python workers causing resource depletion
> ---------------------------------------------------------
>
>                 Key: SPARK-5395
>                 URL: https://issues.apache.org/jira/browse/SPARK-5395
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.2.0
>         Environment: AWS ElasticMapReduce
>            Reporter: Sven Krasser
>
> During job execution a large number of Python worker accumulates eventually 
> causing YARN to kill containers for being over their memory allocation (in 
> the case below that is about 8G for executors plus 6G for overhead per 
> container). 
> In this instance, at the time of killing the container 97 pyspark.daemon 
> processes had accumulated.
> {noformat}
> 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler 
> (Logging.scala:logInfo(59)) - Container marked as failed: 
> container_1421692415636_0052_01_000030. Exit status: 143. Diagnostics: 
> Container [pid=35211,containerID=container_1421692415636_0052_01_000030] is 
> running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB 
> physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing 
> container.
> Dump of the process-tree for container_1421692415636_0052_01_000030 :
> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) 
> VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
> |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m 
> pyspark.daemon
> |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m 
> pyspark.daemon
> |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m 
> pyspark.daemon
>       [...]
> {noformat}
> The configuration used uses 64 containers with 2 cores each.
> Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c
> Mailinglist discussion: 
> https://www.mail-archive.com/user@spark.apache.org/msg20102.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to