[GitHub] spark pull request: [SPARK-8735] [SQL] Expose memory usage for shu...

JoshRosen Sun, 02 Aug 2015 18:58:12 -0700

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/7770#issuecomment-127097979
  
    
    
    
    
    Reviewed 1 of 24 files at r4, 2 of 10 files at r6, 5 of 13 files at r7, 1 
of 1 files at r8, 1 of 4 files at r9, 1 of 3 files at r10, 16 of 20 files at 
r12, 2 of 2 files at r15, 3 of 3 files at r16.
    Review status: all files reviewed at latest revision, 12 unresolved 
discussions, all commit checks successful.
    
    ---
    
    <sup>**[core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala, 
line 791 
\[r1\]](https://reviewable.io:443/reviews/apache/spark/7770#-JvVHinFI4nT0TEsx8sk-r1-791)**
 ([raw 
file](https://github.com/apache/spark/blob/5b5e6f36b8a0e37f1953e12c438e01c58872e5fa/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L791)):</sup>
    Ah, right.  SGTM.
    
    ---
    
    <sup>**[core/src/test/scala/org/apache/spark/AccumulatorSuite.scala, line 
275 
\[r16\]](https://reviewable.io:443/reviews/apache/spark/7770#-JvlEDWoM1qKnCpUfRLP)**
 ([raw 
file](https://github.com/apache/spark/blob/c00a197a2f78844f4f18e52364d55116cb0a374a/core/src/test/scala/org/apache/spark/AccumulatorSuite.scala#L275)):</sup>
    Could also just throw an exception here.
    
    ---
    
    <sup>**[core/src/test/scala/org/apache/spark/ui/StagePageSuite.scala, line 
53 
\[r13\]](https://reviewable.io:443/reviews/apache/spark/7770#-JvkmpXznXanlsxkO4pa)**
 ([raw 
file](https://github.com/apache/spark/blob/a87b4d0e06e407444b746619e1dde96ae0bc4daa/core/src/test/scala/org/apache/spark/ui/StagePageSuite.scala#L53)):</sup>
    Minor Mockito trick that I like using: the second constructor argument to 
`Mock` specifies the default answer returned for non-mocked methods.  If you 
pass `Mockito.RETURNS_SMART_NULLS` as the second argument then you'll get error 
messages which provide much more detail if an NPE occurs due to a non mocked / 
stubbed method.
    
    ---
    
    
<sup>**[core/src/test/scala/org/apache/spark/util/collection/ExternalAppendOnlyMapSuite.scala,
 line 408 
\[r16\]](https://reviewable.io:443/reviews/apache/spark/7770#-Jvl5D7f5rgRjCB303Iq)**
 ([raw 
file](https://github.com/apache/spark/blob/c00a197a2f78844f4f18e52364d55116cb0a374a/core/src/test/scala/org/apache/spark/util/collection/ExternalAppendOnlyMapSuite.scala#L408)):</sup>
    This test seems like it might become slightly brittle in the future.  To 
guard against this, it might be a nice idea to integrate with the listener / 
metrics systems to verify that spilling did/didn't occur.  It's fine to leave 
this to a followup patch, though, given that this is not a huge concern.
    
    ---
    
    
<sup>**[sql/core/src/main/scala/org/apache/spark/sql/execution/GeneratedAggregate.scala,
 line 306 
\[r16\]](https://reviewable.io:443/reviews/apache/spark/7770#-Jvl6qWYX_EY-T3clcvi)**
 ([raw 
file](https://github.com/apache/spark/blob/c00a197a2f78844f4f18e52364d55116cb0a374a/sql/core/src/main/scala/org/apache/spark/sql/execution/GeneratedAggregate.scala#L306)):</sup>
    I think that we should move this call up slightly so that it takes place 
right before we return this result iterator.  This will prevent this memory 
accounting from under-reporting peak memory in case we decide to make 
`aggregationMap.iterator()` into a destructive iterator which frees pages as it 
returns records.
    
    ---
    
    
<sup>**[sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoin.scala,
 line 78 
\[r16\]](https://reviewable.io:443/reviews/apache/spark/7770#-Jvl8u8dGfGCUytrYyfN)**
 ([raw 
file](https://github.com/apache/spark/blob/c00a197a2f78844f4f18e52364d55116cb0a374a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoin.scala#L78)):</sup>
    One thought: a broadcast variable is used by multiple tasks but it will 
only be created once per JVM, so I think that this will slightly overcount the 
broadcast variable's resource usage.  For instance, with the current 
accounting: if you have a four-core machine with 1 gigabyte of RAM per core and 
a 1.5 gigabyte broadcast variable then you might see a peak execution memory of 
4 * 1.5 = 6 gb, which is more than the total RAM, but the actual memory 
consumption wasn't actually this high.
    
    One way to deal with this would be to split a task's memory consumption 
into two categories, private and shared, similar to how operating systems' 
`top` tools (or Activity Monitor in OSX) report shared memory separately from 
applications' private memory.  This change is pretty easy to make and I don't 
think that it needs to be done as part of this patch.
    
    ---
    
    <sup>**[sql/core/src/main/scala/org/apache/spark/sql/execution/sort.scala, 
line 81 
\[r1\]](https://reviewable.io:443/reviews/apache/spark/7770#-JvVHinQX4wFdRHT7yOV-r1-83)**
 ([raw 
file](https://github.com/apache/spark/blob/5b5e6f36b8a0e37f1953e12c438e01c58872e5fa/sql/core/src/main/scala/org/apache/spark/sql/execution/sort.scala#L81)):</sup>
    I'm pretty sure that `.iterator()` can only be called once (given that it 
ends up calling methods with names like `destructiveSorted...`), so I think 
that we should add some asserts to enforce this implicit contract then should 
move this metrics updating code into the `ExternalSorter` itself.  Pushing the 
spill metrics updates closer to the site of the spilling itself will make it 
easier to keep the metrics in sync with the underlying writes.
    
    ---
    
    
<sup>**[sql/core/src/test/scala/org/apache/spark/sql/BroadcastJoinSuite.scala, 
line 18 
\[r16\]](https://reviewable.io:443/reviews/apache/spark/7770#-JvlAdjyYunIYwbkUn9y)**
 ([raw 
file](https://github.com/apache/spark/blob/c00a197a2f78844f4f18e52364d55116cb0a374a/sql/core/src/test/scala/org/apache/spark/sql/BroadcastJoinSuite.scala#L18)):</sup>
    Minor nit: since this is testing something in the `sql.executions.joins_` 
package, the test should probably be in that package as well.
    
    ---
    
    
<sup>**[sql/core/src/test/scala/org/apache/spark/sql/BroadcastJoinSuite.scala, 
line 49 
\[r16\]](https://reviewable.io:443/reviews/apache/spark/7770#-JvlBEPnOWNFyn4RzkCF)**
 ([raw 
file](https://github.com/apache/spark/blob/c00a197a2f78844f4f18e52364d55116cb0a374a/sql/core/src/test/scala/org/apache/spark/sql/BroadcastJoinSuite.scala#L49)):</sup>
    I'm fairly sure that this is going to lead to a "multiple active 
SparkContexts in the same JVM" failure, since TestSQLContext might have already 
created an active global context before this runs.  Fixing this might require 
some fairly large refactorings to TestSQLContext and our other testing code, so 
for now we may end up having to disable this test.  If we need to do that, 
let's file a 1.5.0 test blocker JIRA.
    
    ---
    
    
    Comments from the [review on 
Reviewable.io](https://reviewable.io:443/reviews/apache/spark/7770)
    <!-- Sent from Reviewable.io -->




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-8735] [SQL] Expose memory usage for shu...

Reply via email to