[GitHub] spark pull request: [SPARK-8735] [WIP] [SQL] Expose memory usage f...

andrewor14 Wed, 29 Jul 2015 19:47:32 -0700

GitHub user andrewor14 opened a pull request:

    https://github.com/apache/spark/pull/7770


    [SPARK-8735] [WIP] [SQL] Expose memory usage for shuffles, joins and 
aggregations

    This patch exposes the memory used by internal data structures on the 
SparkUI. This tracks memory used by all spilling operations and SQL operators 
backed by Tungsten, e.g. `BroadcastHashJoin`, `ExternalSort`, 
`GeneratedAggregate` etc. The metric exposed is "peak execution memory", which 
broadly refers to the peak in-memory sizes of each of these data structure.
    
    WIP because tests are coming soon.
    
    <img width="950" alt="screen shot 2015-07-29 at 7 43 17 pm" 
src="https://cloud.githubusercontent.com/assets/2133137/8974760/87d65a1e-362a-11e5-998c-f24c6cc73b82.png";>
    
    <img width="802" alt="screen shot 2015-07-29 at 7 43 05 pm" 
src="https://cloud.githubusercontent.com/assets/2133137/8974757/85786744-362a-11e5-9345-fc6e6aa0dfa3.png";>
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/andrewor14/spark expose-memory-metrics

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/7770.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #7770
    
----
commit bd7ab3f0b8552becd749c56ac7863081d05558df
Author: Andrew Or <[email protected]>
Date:   2015-07-28T23:36:02Z

    Add internal accumulators to TaskContext
    
    Currently there is only one accumulator: peak execution memory,
    which refers to the sizes of all data structures created in
    shuffles, aggregations and joins.

commit 3c4f042f2f85d31ea6c8f9e90ebf06e94a6b377f
Author: Andrew Or <[email protected]>
Date:   2015-07-29T01:05:20Z

    Track memory usage in ExternalAppendOnlyMap / ExternalSorter
    
    These are now tracked through the execution memory accumulator
    for each task.

commit a417592cc5cff12e5e4d16b262ce6849b38c405e
Author: Andrew Or <[email protected]>
Date:   2015-07-29T02:11:21Z

    Expose memory metrics in UnsafeExternalSorter

commit e6c3e2f53b27842f2a29d690df6429af4907d98f
Author: Andrew Or <[email protected]>
Date:   2015-07-29T21:22:25Z

    Move internal accumulators creation to Stage
    
    This is for two reasons:
    
    (1) Accumulators must be created on the driver such that all
    executors can use the same accumulator IDs to access the correct
    accumulators.
    
    (2) Accumulators should be created on the stage level to allow
    us to compare the accumulator values across all tasks within the
    stage. This representation is more useful when we expose it to
    the UI properly later.

commit 4ef4cb11ab87327e5fe5d9146ed9d8a3f88ac66c
Author: Andrew Or <[email protected]>
Date:   2015-07-29T21:30:11Z

    Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
    
    Conflicts:
        core/src/main/scala/org/apache/spark/scheduler/Task.scala
        
sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala

commit 9e824f2d5755404f791930a6fb49db82ff57d140
Author: Andrew Or <[email protected]>
Date:   2015-07-29T21:43:16Z

    Add back execution memory tracking for *ExternalSort
    
    This was removed in a merge conflict.

commit 9c605a4935f679555b68ccd51e3f74f3368d04d1
Author: Andrew Or <[email protected]>
Date:   2015-07-29T23:15:32Z

    Track execution memory in GeneratedAggregate

commit 770ee54d2311f849317b85f719885dabaf8c37d4
Author: Andrew Or <[email protected]>
Date:   2015-07-30T00:18:54Z

    Track execution memory in broadcast joins

commit d9b90155617f1f092dff3d4f72610ddcc2561562
Author: Andrew Or <[email protected]>
Date:   2015-07-30T00:50:36Z

    Track execution memory in unsafe shuffles

commit eee54371beff1900e29e875ebb34c333f571dc1e
Author: Andrew Or <[email protected]>
Date:   2015-07-30T00:54:20Z

    Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
    
    Conflicts:
        
core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java
        
sql/core/src/main/scala/org/apache/spark/sql/execution/GeneratedAggregate.scala

commit 92b4b6b18760d0d1524570c12a84407a7ad43762
Author: Andrew Or <[email protected]>
Date:   2015-07-30T02:07:06Z

    Display peak execution memory on the UI
    
    This commit makes the UI display internal accumulators differently.
    A future commit will add this to the summary metrics table and
    add an informative tooltip to explain what the execution memory
    means.

commit 5b5e6f36b8a0e37f1953e12c438e01c58872e5fa
Author: Andrew Or <[email protected]>
Date:   2015-07-30T02:40:10Z

    Add peak execution memory to summary table + tooltip

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-8735] [WIP] [SQL] Expose memory usage f...

Reply via email to