[ 
https://issues.apache.org/jira/browse/SPARK-34845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-34845:
--------------------------------
    Description: 
When the procfs metrics of some child pids are unavailable, 
ProcfsMetricsGetter.computeAllMetrics() returns partial metrics (the sum of a 
subset of child pids), instead of an all 0 result. This can be misleading and 
is undesired per the current code comments in 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214].

 

Also, a side effect of it is that it can lead to a verbose warning log if many 
pids' stat files are missing. Also, a side effect of it is that it can lead to 
verbose warning logs if many pids' stat files are missing.
{code:java}
e.g.2021-03-21 16:58:25,422 [pool-26-thread-8] WARN  
org.apache.spark.executor.ProcfsMetricsGetter  - There was a problem with 
reading the stat file of the process. java.io.FileNotFoundException: 
/proc/742/stat (No such file or directory) at 
java.io.FileInputStream.open0(Native Method) at 
java.io.FileInputStream.open(FileInputStream.java:195) at 
java.io.FileInputStream.<init>(FileInputStream.java:138) at 
org.apache.spark.executor.ProcfsMetricsGetter.openReader$1(ProcfsMetricsGetter.scala:203)
 at 
org.apache.spark.executor.ProcfsMetricsGetter.$anonfun$addProcfsMetricsFromOneProcess$1(ProcfsMetricsGetter.scala:205)
 at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2540) at 
org.apache.spark.executor.ProcfsMetricsGetter.addProcfsMetricsFromOneProcess(ProcfsMetricsGetter.scala:205)
 at 
org.apache.spark.executor.ProcfsMetricsGetter.$anonfun$computeAllMetrics$1(ProcfsMetricsGetter.scala:297)
{code}
The issue can be fixed by updating the flag isAvailable to false when one of 
the child pid's procfs metric is unavailable. Other methods computePid, 
computePageSize, and getChildPids already have this behavior.

  was:
When the procfs metrics of some child pids are unavailable, 
ProcfsMetricsGetter.computeAllMetrics() returns partial metrics (the sum of a 
subset of child pids), instead of an all 0 result. This can be misleading and 
is undesired per the current code comments in 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214].

 

Also, a side effect of it is that it can lead to a verbose warning log if many 
pids' stat files are missing. Also, a side effect of it is that it can lead to 
verbose warning logs if many pids' stat files are missing.
{noformat}
e.g.2021-03-21 16:58:25,422 [pool-26-thread-8] WARN  
org.apache.spark.executor.ProcfsMetricsGetter  - There was a problem with 
reading the stat file of the process. java.io.FileNotFoundException: 
/proc/742/stat (No such file or directory) at 
java.io.FileInputStream.open0(Native Method) at 
java.io.FileInputStream.open(FileInputStream.java:195) at 
java.io.FileInputStream.<init>(FileInputStream.java:138) at 
org.apache.spark.executor.ProcfsMetricsGetter.openReader$1(ProcfsMetricsGetter.scala:203)
 at 
org.apache.spark.executor.ProcfsMetricsGetter.$anonfun$addProcfsMetricsFromOneProcess$1(ProcfsMetricsGetter.scala:205)
 at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2540) at 
org.apache.spark.executor.ProcfsMetricsGetter.addProcfsMetricsFromOneProcess(ProcfsMetricsGetter.scala:205)
 at 
org.apache.spark.executor.ProcfsMetricsGetter.$anonfun$computeAllMetrics$1(ProcfsMetricsGetter.scala:297){noformat}
The issue can be fixed by updating the flag isAvailable to false when one of 
the child pid's procfs metric is unavailable. Other methods computePid, 
computePageSize, and getChildPids already have this behavior.


> ProcfsMetricsGetter.computeAllMetrics shouldn't return partial metrics when 
> some of child pids metrics are missing
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-34845
>                 URL: https://issues.apache.org/jira/browse/SPARK-34845
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>            Reporter: Baohe Zhang
>            Priority: Major
>
> When the procfs metrics of some child pids are unavailable, 
> ProcfsMetricsGetter.computeAllMetrics() returns partial metrics (the sum of a 
> subset of child pids), instead of an all 0 result. This can be misleading and 
> is undesired per the current code comments in 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214].
>  
> Also, a side effect of it is that it can lead to a verbose warning log if 
> many pids' stat files are missing. Also, a side effect of it is that it can 
> lead to verbose warning logs if many pids' stat files are missing.
> {code:java}
> e.g.2021-03-21 16:58:25,422 [pool-26-thread-8] WARN  
> org.apache.spark.executor.ProcfsMetricsGetter  - There was a problem with 
> reading the stat file of the process. java.io.FileNotFoundException: 
> /proc/742/stat (No such file or directory) at 
> java.io.FileInputStream.open0(Native Method) at 
> java.io.FileInputStream.open(FileInputStream.java:195) at 
> java.io.FileInputStream.<init>(FileInputStream.java:138) at 
> org.apache.spark.executor.ProcfsMetricsGetter.openReader$1(ProcfsMetricsGetter.scala:203)
>  at 
> org.apache.spark.executor.ProcfsMetricsGetter.$anonfun$addProcfsMetricsFromOneProcess$1(ProcfsMetricsGetter.scala:205)
>  at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2540) at 
> org.apache.spark.executor.ProcfsMetricsGetter.addProcfsMetricsFromOneProcess(ProcfsMetricsGetter.scala:205)
>  at 
> org.apache.spark.executor.ProcfsMetricsGetter.$anonfun$computeAllMetrics$1(ProcfsMetricsGetter.scala:297)
> {code}
> The issue can be fixed by updating the flag isAvailable to false when one of 
> the child pid's procfs metric is unavailable. Other methods computePid, 
> computePageSize, and getChildPids already have this behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to