[GitHub] spark pull request: [SPARK-8735] [SQL] Expose memory usage for shu...

JoshRosen Sun, 02 Aug 2015 13:21:06 -0700

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/7770#issuecomment-127066305
  
    I took a quick pass through the current diff.
    
    One high-level question:
    
    The tooltip comment says that this will only be used if Tungsten is 
enabled, but I noticed that there are also peak memory consumption tests for 
several non-Tungsten operators, including the existing external sorter, 
ExternalAppendOnlyMap, etc.  Is the idea that displaying only those operators' 
memory usage for non-Tungsten jobs will be confusing? It seems like the same 
confusion could arise if a user has Tungsten enabled but is running non-SQL 
jobs. Given this, I wonder whether it makes sense to just always show that 
metric irrespective of whether Tungsten is used.
    
    ---
    
    Reviewed 7 of 35 files at r1, 8 of 11 files at r2, 2 of 4 files at r3, 4 of 
24 files at r4, 1 of 1 files at r5, 4 of 10 files at r6, 1 of 13 files at r7, 3 
of 4 files at r9, 1 of 3 files at r10.
    Review status: 31 of 49 files reviewed at latest revision, 17 unresolved 
discussions, some commit checks failed.
    
    ---
    
    
<sup>**[core/src/main/java/org/apache/spark/shuffle/unsafe/UnsafeShuffleWriter.java,
 line 455 
\[r11\]](https://reviewable.io:443/reviews/apache/spark/7770#-JvjyFnipw8r_uM6llmJ)**
 ([raw 
file](https://github.com/apache/spark/blob/6aa2f7a8c2f4eb1de6281593326dce5a92d5c1e3/core/src/main/java/org/apache/spark/shuffle/unsafe/UnsafeShuffleWriter.java#L455)):</sup>
    This could possibly be null due to mocking.  Do you remember which tests 
this was null in?
    
    ---
    
    
<sup>**[core/src/main/java/org/apache/spark/shuffle/unsafe/UnsafeShuffleWriter.java,
 line 459 
\[r11\]](https://reviewable.io:443/reviews/apache/spark/7770#-JvjySxUYG3yq_glTmT-)**
 ([raw 
file](https://github.com/apache/spark/blob/6aa2f7a8c2f4eb1de6281593326dce5a92d5c1e3/core/src/main/java/org/apache/spark/shuffle/unsafe/UnsafeShuffleWriter.java#L459)):</sup>
    Why is this Java conversion necessary?  As far as I know, you should still 
be able to call methods in Scala maps from Java, although you might have some 
weird looking imports.
    
    ---
    
    <sup>**[core/src/main/scala/org/apache/spark/Accumulators.scala, line 157 
\[r11\]](https://reviewable.io:443/reviews/apache/spark/7770#-Jvjz0QAGo9oQR2jYhzP)**
 ([raw 
file](https://github.com/apache/spark/blob/6aa2f7a8c2f4eb1de6281593326dce5a92d5c1e3/core/src/main/scala/org/apache/spark/Accumulators.scala#L157)):</sup>
    I fear that this could mask bugs if TaskContext is null when we're trying 
to deserialize an external accumulator.  Instead of doing a null check here, 
could you write out the `isInternal` flag and check that here to decide whether 
to register? If you do that, can you also add a comment that acts as a 
cross-reference to explain where internal accumulators are registered?
    
    ---
    
    <sup>**[core/src/main/scala/org/apache/spark/Accumulators.scala, line 264 
\[r11\]](https://reviewable.io:443/reviews/apache/spark/7770#-JvjzLl6KJl6UVVA1CmA)**
 ([raw 
file](https://github.com/apache/spark/blob/6aa2f7a8c2f4eb1de6281593326dce5a92d5c1e3/core/src/main/scala/org/apache/spark/Accumulators.scala#L264)):</sup>
    Name boolean parameters?  IntelliJ likes to complain about this.
    
    ---
    
    <sup>**[core/src/main/scala/org/apache/spark/Accumulators.scala, line 268 
\[r11\]](https://reviewable.io:443/reviews/apache/spark/7770#-JvjzK9DHuqJ_ogH2Wg4)**
 ([raw 
file](https://github.com/apache/spark/blob/6aa2f7a8c2f4eb1de6281593326dce5a92d5c1e3/core/src/main/scala/org/apache/spark/Accumulators.scala#L268)):</sup>
    Name boolean parameters?
    
    ---
    
    <sup>**[core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala, 
line 791 
\[r1\]](https://reviewable.io:443/reviews/apache/spark/7770#-JvVHinFI4nT0TEsx8sk-r1-791)**
 ([raw 
file](https://github.com/apache/spark/blob/5b5e6f36b8a0e37f1953e12c438e01c58872e5fa/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L791)):</sup>
    Will this change break any user programs that may have relied on the old 
behavior? Was the old behavior specified?
    
    ---
    
    <sup>**[core/src/main/scala/org/apache/spark/scheduler/Stage.scala, line 78 
\[r11\]](https://reviewable.io:443/reviews/apache/spark/7770#-Jvk-IpC7oolKQcwsJeO)**
 ([raw 
file](https://github.com/apache/spark/blob/6aa2f7a8c2f4eb1de6281593326dce5a92d5c1e3/core/src/main/scala/org/apache/spark/scheduler/Stage.scala#L78)):</sup>
    Should this comment describe what happens during partial stage 
recomputations?
    
    ---
    
    <sup>**[core/src/main/scala/org/apache/spark/TaskContext.scala, line 65 
\[r11\]](https://reviewable.io:443/reviews/apache/spark/7770#-Jvk-YWmoTLAUs2KfDN3)**
 ([raw 
file](https://github.com/apache/spark/blob/6aa2f7a8c2f4eb1de6281593326dce5a92d5c1e3/core/src/main/scala/org/apache/spark/TaskContext.scala#L65)):</sup>
    Could use a `@VisibleForTesting` annotation here.
    
    ---
    
    <sup>**[core/src/main/scala/org/apache/spark/TaskContext.scala, line 67 
\[r11\]](https://reviewable.io:443/reviews/apache/spark/7770#-Jvk-blSJZ0ZULb49BEx)**
 ([raw 
file](https://github.com/apache/spark/blob/6aa2f7a8c2f4eb1de6281593326dce5a92d5c1e3/core/src/main/scala/org/apache/spark/TaskContext.scala#L67)):</sup>
    I thought that you could call `private[spark]` and `protected[spark]` 
methods from Java?
    
    ---
    
    <sup>**[core/src/test/scala/org/apache/spark/CacheManagerSuite.scala, line 
89 
\[r11\]](https://reviewable.io:443/reviews/apache/spark/7770#-Jvk1DKhxW_LXgEZcwle)**
 ([raw 
file](https://github.com/apache/spark/blob/6aa2f7a8c2f4eb1de6281593326dce5a92d5c1e3/core/src/test/scala/org/apache/spark/CacheManagerSuite.scala#L89)):</sup>
    Given that we removed local execution in 1.5, we might be able to remove 
this code as well.  Shouldn't happen here, but just wanted to note it since I 
just noticed this.
    
    ---
    
    <sup>**[sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala, 
line 1621 
\[r11\]](https://reviewable.io:443/reviews/apache/spark/7770#-Jvk1l80_3dvUKbI95U6)**
 ([raw 
file](https://github.com/apache/spark/blob/6aa2f7a8c2f4eb1de6281593326dce5a92d5c1e3/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala#L1621)):</sup>
    Instead of using `originalValue` and a `finally`, block, this can be 
slightly simplified by using the new `withSQLConf` helper methods from 
`SQLTestUtils` (which is mixed into this suite).  Take a look at other uses in 
this file; should be straightforward cleanup.
    
    ---
    
    
<sup>**[unsafe/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java, 
line 343 
\[r11\]](https://reviewable.io:443/reviews/apache/spark/7770#-Jvk20Ae8fgwebtW_UOT)**
 ([raw 
file](https://github.com/apache/spark/blob/82f47b811607a1eeeecba437fe0ffc15d4e5f9ec/unsafe/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java#L343)):</sup>
    This change is conflicted due to refactoring in Reynold's latest patch.
    
    ---
    
    
    Comments from the [review on 
Reviewable.io](https://reviewable.io:443/reviews/apache/spark/7770)
    <!-- Sent from Reviewable.io -->




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-8735] [SQL] Expose memory usage for shu...

Reply via email to