Tom Howland created SPARK-34015:
-----------------------------------
Summary: SparkR partition timing summary reports input time
correctly
Key: SPARK-34015
URL: https://issues.apache.org/jira/browse/SPARK-34015
Project: Spark
Issue Type: Bug
Components: SparkR
Affects Versions: 3.0.1, 2.3.2
Environment: Observed on CentOS-7 running spark 2.3.1 and on my mac
running master
Reporter: Tom Howland
When sparkR is run at log level INFO, a summary of how the worker spent its
time processing the partition is printed. There is a logic error where it is
over-reporting the time inputting rows.
In detail: the variable inputElap in a wider context is used to mark the
beginning of reading rows, but in the part changed here it was used as a local
variable for measuring compute time. Thus, the error is not observable if there
is only one group per partition, which is what you get in unit tests.
For our application, here's what a log entry looks like before these changes
were applied:
{{20/10/09 04:08:58 WARN RRunner: Times: boot = 0.013 s, init = 0.005 s,
broadcast = 0.000 s, read-input = 529.471 s, compute = 492.037 s, write-output
= 0.020 s, total = 1021.546 s}}
this indicates that we're spending more time reading rows than operating on the
rows.
After these changes, it looks like this:
{{20/12/15 06:43:29 WARN RRunner: Times: boot = 0.013 s, init = 0.010 s,
broadcast = 0.000 s, read-input = 120.275 s, compute = 1680.161 s, write-output
= 0.045 s, total = 1812.553 s}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]