[
https://issues.apache.org/jira/browse/SPARK-52776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun closed SPARK-52776.
---------------------------------
> ProcfsMetricsGetter splits the comm field if it contains space characters
> -------------------------------------------------------------------------
>
> Key: SPARK-52776
> URL: https://issues.apache.org/jira/browse/SPARK-52776
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.5.0, 4.1.0, 4.0.0
> Reporter: Maxime Xu
> Assignee: Maxime Xu
> Priority: Minor
> Labels: pull-request-available
> Fix For: 4.1.0, 4.0.1, 3.5.7
>
>
> When reading the /proc/<pid>/stat file, ProcfsMetricsGetter currently does a
> split by the space character (see
> [here|https://github.com/apache/spark/blob/842633011325bee1a5a0b4243a1081c086394def/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L99C37-L99C48]).
> However, we observed that the comm field in the stat file could contain
> spaces. This shifts the numbers that come after the comm, causing issues like
> incorrect reporting of vmem/rssmem.
> Sample stat file captured for reference
> {code:java}
> 487713 (Executor task l) D 474416 474398 474398 0 -1 4194368 5 0 0 0 0 0 0 0
> 25 5 1 0 1542745216 7469137920 120815 18446744073709551615 104424108929024
> 104424108932808 140734257079632 0 0 0 4 3 553671884 1 0 0 17 58 0 0 0 0 0
> 104424108940536 104424108941336 104424532111360 140734257083781
> 140734257085131 140734257085131 140734257102797 0 {code}
> In the above example, the two spaces in the comm field causes the calculated
> rssmem to be 1542745216 * 4096 (page size) ~ 5.9TiB when the right value
> should be 120815 * 4096 ~ 470MiB.
> The {{Executor task l}} name is derived from
> [here|https://github.com/apache/spark/blob/842633011325bee1a5a0b4243a1081c086394def/core/src/main/scala/org/apache/spark/executor/Executor.scala#L135]
> but the process gets renamed shortly after the Python process starts. IfÂ
> {{ProcfsMetricsGetter}} reads the stat file after the process started but
> before the name change, it will cause this issue. This makes reproducing this
> bug tricky. The way we did it was to run hundreds or thousands of small
> PySpark jobs while setting spark.executor.heartbeatInterval to a lower value
> like 0.1s to increase the likelihood of encountering this bug.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]