GitHub user ala opened a pull request:

    https://github.com/apache/spark/pull/16713

    [SC-5550] Automatic killing of tasks that are producing too many output rows

    ## What changes were proposed in this pull request?
    
    This change implements TaskOutputListener, which continuously monitors the 
metric updates send gradually by tasks in execution. In particular, the number 
of records read, generated and produced is inspected. Tasks for which the ratio 
between (number of produced records) and (number or records read + generated) 
exceeds the threshold are canceled. This mechanism is off by default, can be 
turned on by modifying parameter spark.outputRatioKillThreshold.
    
    Additionally, a bunch of tests were added to check the correctness of 
metrics mentioned above.
    
    For the Range operator, a new metric "number of generated rows" was added.
    
    ## How was this patch tested?
    
    Unit tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ala/spark metrics

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16713.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16713
    
----
commit 29d36b879550d03551169cd10ae2d22c5e54c1c1
Author: Ala Luszczak <[email protected]>
Date:   2017-01-26T17:36:01Z

    This change implements TaskOutputListener, which continuously monitors the 
metric updates send gradually by tasks in execution. In particular, the number 
of records read, generated and produced is inspected. Tasks for which the ratio 
between (number of produced records) and (number or records read + generated) 
exceeds the threshold are canceled. This mechanism is off by default, can be 
turned on by modifying parameter spark.outputRatioKillThreshold.
    
    Additionally, a bunch of tests were added to check the correctness of 
metrics mentioned above.
    
    For the Range operator, a new metric "number of generated rows" was added.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to