GitHub user andrewor14 opened a pull request:
https://github.com/apache/spark/pull/7770
[SPARK-8735] [WIP] [SQL] Expose memory usage for shuffles, joins and
aggregations
This patch exposes the memory used by internal data structures on the
SparkUI. This tracks memory used by all spilling operations and SQL operators
backed by Tungsten, e.g. `BroadcastHashJoin`, `ExternalSort`,
`GeneratedAggregate` etc. The metric exposed is "peak execution memory", which
broadly refers to the peak in-memory sizes of each of these data structure.
WIP because tests are coming soon.
<img width="950" alt="screen shot 2015-07-29 at 7 43 17 pm"
src="https://cloud.githubusercontent.com/assets/2133137/8974760/87d65a1e-362a-11e5-998c-f24c6cc73b82.png">
<img width="802" alt="screen shot 2015-07-29 at 7 43 05 pm"
src="https://cloud.githubusercontent.com/assets/2133137/8974757/85786744-362a-11e5-9345-fc6e6aa0dfa3.png">
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/andrewor14/spark expose-memory-metrics
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/7770.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #7770
----
commit bd7ab3f0b8552becd749c56ac7863081d05558df
Author: Andrew Or <[email protected]>
Date: 2015-07-28T23:36:02Z
Add internal accumulators to TaskContext
Currently there is only one accumulator: peak execution memory,
which refers to the sizes of all data structures created in
shuffles, aggregations and joins.
commit 3c4f042f2f85d31ea6c8f9e90ebf06e94a6b377f
Author: Andrew Or <[email protected]>
Date: 2015-07-29T01:05:20Z
Track memory usage in ExternalAppendOnlyMap / ExternalSorter
These are now tracked through the execution memory accumulator
for each task.
commit a417592cc5cff12e5e4d16b262ce6849b38c405e
Author: Andrew Or <[email protected]>
Date: 2015-07-29T02:11:21Z
Expose memory metrics in UnsafeExternalSorter
commit e6c3e2f53b27842f2a29d690df6429af4907d98f
Author: Andrew Or <[email protected]>
Date: 2015-07-29T21:22:25Z
Move internal accumulators creation to Stage
This is for two reasons:
(1) Accumulators must be created on the driver such that all
executors can use the same accumulator IDs to access the correct
accumulators.
(2) Accumulators should be created on the stage level to allow
us to compare the accumulator values across all tasks within the
stage. This representation is more useful when we expose it to
the UI properly later.
commit 4ef4cb11ab87327e5fe5d9146ed9d8a3f88ac66c
Author: Andrew Or <[email protected]>
Date: 2015-07-29T21:30:11Z
Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
Conflicts:
core/src/main/scala/org/apache/spark/scheduler/Task.scala
sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala
commit 9e824f2d5755404f791930a6fb49db82ff57d140
Author: Andrew Or <[email protected]>
Date: 2015-07-29T21:43:16Z
Add back execution memory tracking for *ExternalSort
This was removed in a merge conflict.
commit 9c605a4935f679555b68ccd51e3f74f3368d04d1
Author: Andrew Or <[email protected]>
Date: 2015-07-29T23:15:32Z
Track execution memory in GeneratedAggregate
commit 770ee54d2311f849317b85f719885dabaf8c37d4
Author: Andrew Or <[email protected]>
Date: 2015-07-30T00:18:54Z
Track execution memory in broadcast joins
commit d9b90155617f1f092dff3d4f72610ddcc2561562
Author: Andrew Or <[email protected]>
Date: 2015-07-30T00:50:36Z
Track execution memory in unsafe shuffles
commit eee54371beff1900e29e875ebb34c333f571dc1e
Author: Andrew Or <[email protected]>
Date: 2015-07-30T00:54:20Z
Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
Conflicts:
core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java
sql/core/src/main/scala/org/apache/spark/sql/execution/GeneratedAggregate.scala
commit 92b4b6b18760d0d1524570c12a84407a7ad43762
Author: Andrew Or <[email protected]>
Date: 2015-07-30T02:07:06Z
Display peak execution memory on the UI
This commit makes the UI display internal accumulators differently.
A future commit will add this to the summary metrics table and
add an informative tooltip to explain what the execution memory
means.
commit 5b5e6f36b8a0e37f1953e12c438e01c58872e5fa
Author: Andrew Or <[email protected]>
Date: 2015-07-30T02:40:10Z
Add peak execution memory to summary table + tooltip
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]