Tom Howland created SPARK-34015:
-----------------------------------

             Summary: SparkR partition timing summary reports input time 
correctly
                 Key: SPARK-34015
                 URL: https://issues.apache.org/jira/browse/SPARK-34015
             Project: Spark
          Issue Type: Bug
          Components: SparkR
    Affects Versions: 3.0.1, 2.3.2
         Environment: Observed on CentOS-7 running spark 2.3.1 and on my mac 
running master
            Reporter: Tom Howland


When sparkR is run at log level INFO, a summary of how the worker spent its 
time processing the partition is printed. There is a logic error where it is 
over-reporting the time inputting rows.

In detail: the variable inputElap in a wider context is used to mark the 
beginning of reading rows, but in the part changed here it was used as a local 
variable for measuring compute time. Thus, the error is not observable if there 
is only one group per partition, which is what you get in unit tests.

For our application, here's what a log entry looks like before these changes 
were applied:

{{20/10/09 04:08:58 WARN RRunner: Times: boot = 0.013 s, init = 0.005 s, 
broadcast = 0.000 s, read-input = 529.471 s, compute = 492.037 s, write-output 
= 0.020 s, total = 1021.546 s}}

this indicates that we're spending more time reading rows than operating on the 
rows.

After these changes, it looks like this:

{{20/12/15 06:43:29 WARN RRunner: Times: boot = 0.013 s, init = 0.010 s, 
broadcast = 0.000 s, read-input = 120.275 s, compute = 1680.161 s, write-output 
= 0.045 s, total = 1812.553 s}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to