[ 
https://issues.apache.org/jira/browse/SPARK-52776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-52776.
---------------------------------

> ProcfsMetricsGetter splits the comm field if it contains space characters
> -------------------------------------------------------------------------
>
>                 Key: SPARK-52776
>                 URL: https://issues.apache.org/jira/browse/SPARK-52776
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.5.0, 4.1.0, 4.0.0
>            Reporter: Maxime Xu
>            Assignee: Maxime Xu
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 4.1.0, 4.0.1, 3.5.7
>
>
> When reading the /proc/<pid>/stat file, ProcfsMetricsGetter currently does a 
> split by the space character (see 
> [here|https://github.com/apache/spark/blob/842633011325bee1a5a0b4243a1081c086394def/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L99C37-L99C48]).
>  However, we observed that the comm field in the stat file could contain 
> spaces. This shifts the numbers that come after the comm, causing issues like 
> incorrect reporting of vmem/rssmem.
> Sample stat file captured for reference
> {code:java}
> 487713 (Executor task l) D 474416 474398 474398 0 -1 4194368 5 0 0 0 0 0 0 0 
> 25 5 1 0 1542745216 7469137920 120815 18446744073709551615 104424108929024 
> 104424108932808 140734257079632 0 0 0 4 3 553671884 1 0 0 17 58 0 0 0 0 0 
> 104424108940536 104424108941336 104424532111360 140734257083781 
> 140734257085131 140734257085131 140734257102797 0 {code}
> In the above example, the two spaces in the comm field causes the calculated 
> rssmem to be 1542745216 * 4096 (page size) ~ 5.9TiB when the right value 
> should be 120815 * 4096 ~ 470MiB.
> The {{Executor task l}} name is derived from 
> [here|https://github.com/apache/spark/blob/842633011325bee1a5a0b4243a1081c086394def/core/src/main/scala/org/apache/spark/executor/Executor.scala#L135]
>  but the process gets renamed shortly after the Python process starts. If 
> {{ProcfsMetricsGetter}} reads the stat file after the process started but 
> before the name change, it will cause this issue. This makes reproducing this 
> bug tricky. The way we did it was to run hundreds or thousands of small 
> PySpark jobs while setting spark.executor.heartbeatInterval to a lower value 
> like 0.1s to increase the likelihood of encountering this bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to