Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/7770#issuecomment-127097979
Reviewed 1 of 24 files at r4, 2 of 10 files at r6, 5 of 13 files at r7, 1
of 1 files at r8, 1 of 4 files at r9, 1 of 3 files at r10, 16 of 20 files at
r12, 2 of 2 files at r15, 3 of 3 files at r16.
Review status: all files reviewed at latest revision, 12 unresolved
discussions, all commit checks successful.
---
<sup>**[core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala,
line 791
\[r1\]](https://reviewable.io:443/reviews/apache/spark/7770#-JvVHinFI4nT0TEsx8sk-r1-791)**
([raw
file](https://github.com/apache/spark/blob/5b5e6f36b8a0e37f1953e12c438e01c58872e5fa/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L791)):</sup>
Ah, right. SGTM.
---
<sup>**[core/src/test/scala/org/apache/spark/AccumulatorSuite.scala, line
275
\[r16\]](https://reviewable.io:443/reviews/apache/spark/7770#-JvlEDWoM1qKnCpUfRLP)**
([raw
file](https://github.com/apache/spark/blob/c00a197a2f78844f4f18e52364d55116cb0a374a/core/src/test/scala/org/apache/spark/AccumulatorSuite.scala#L275)):</sup>
Could also just throw an exception here.
---
<sup>**[core/src/test/scala/org/apache/spark/ui/StagePageSuite.scala, line
53
\[r13\]](https://reviewable.io:443/reviews/apache/spark/7770#-JvkmpXznXanlsxkO4pa)**
([raw
file](https://github.com/apache/spark/blob/a87b4d0e06e407444b746619e1dde96ae0bc4daa/core/src/test/scala/org/apache/spark/ui/StagePageSuite.scala#L53)):</sup>
Minor Mockito trick that I like using: the second constructor argument to
`Mock` specifies the default answer returned for non-mocked methods. If you
pass `Mockito.RETURNS_SMART_NULLS` as the second argument then you'll get error
messages which provide much more detail if an NPE occurs due to a non mocked /
stubbed method.
---
<sup>**[core/src/test/scala/org/apache/spark/util/collection/ExternalAppendOnlyMapSuite.scala,
line 408
\[r16\]](https://reviewable.io:443/reviews/apache/spark/7770#-Jvl5D7f5rgRjCB303Iq)**
([raw
file](https://github.com/apache/spark/blob/c00a197a2f78844f4f18e52364d55116cb0a374a/core/src/test/scala/org/apache/spark/util/collection/ExternalAppendOnlyMapSuite.scala#L408)):</sup>
This test seems like it might become slightly brittle in the future. To
guard against this, it might be a nice idea to integrate with the listener /
metrics systems to verify that spilling did/didn't occur. It's fine to leave
this to a followup patch, though, given that this is not a huge concern.
---
<sup>**[sql/core/src/main/scala/org/apache/spark/sql/execution/GeneratedAggregate.scala,
line 306
\[r16\]](https://reviewable.io:443/reviews/apache/spark/7770#-Jvl6qWYX_EY-T3clcvi)**
([raw
file](https://github.com/apache/spark/blob/c00a197a2f78844f4f18e52364d55116cb0a374a/sql/core/src/main/scala/org/apache/spark/sql/execution/GeneratedAggregate.scala#L306)):</sup>
I think that we should move this call up slightly so that it takes place
right before we return this result iterator. This will prevent this memory
accounting from under-reporting peak memory in case we decide to make
`aggregationMap.iterator()` into a destructive iterator which frees pages as it
returns records.
---
<sup>**[sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoin.scala,
line 78
\[r16\]](https://reviewable.io:443/reviews/apache/spark/7770#-Jvl8u8dGfGCUytrYyfN)**
([raw
file](https://github.com/apache/spark/blob/c00a197a2f78844f4f18e52364d55116cb0a374a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoin.scala#L78)):</sup>
One thought: a broadcast variable is used by multiple tasks but it will
only be created once per JVM, so I think that this will slightly overcount the
broadcast variable's resource usage. For instance, with the current
accounting: if you have a four-core machine with 1 gigabyte of RAM per core and
a 1.5 gigabyte broadcast variable then you might see a peak execution memory of
4 * 1.5 = 6 gb, which is more than the total RAM, but the actual memory
consumption wasn't actually this high.
One way to deal with this would be to split a task's memory consumption
into two categories, private and shared, similar to how operating systems'
`top` tools (or Activity Monitor in OSX) report shared memory separately from
applications' private memory. This change is pretty easy to make and I don't
think that it needs to be done as part of this patch.
---
<sup>**[sql/core/src/main/scala/org/apache/spark/sql/execution/sort.scala,
line 81
\[r1\]](https://reviewable.io:443/reviews/apache/spark/7770#-JvVHinQX4wFdRHT7yOV-r1-83)**
([raw
file](https://github.com/apache/spark/blob/5b5e6f36b8a0e37f1953e12c438e01c58872e5fa/sql/core/src/main/scala/org/apache/spark/sql/execution/sort.scala#L81)):</sup>
I'm pretty sure that `.iterator()` can only be called once (given that it
ends up calling methods with names like `destructiveSorted...`), so I think
that we should add some asserts to enforce this implicit contract then should
move this metrics updating code into the `ExternalSorter` itself. Pushing the
spill metrics updates closer to the site of the spilling itself will make it
easier to keep the metrics in sync with the underlying writes.
---
<sup>**[sql/core/src/test/scala/org/apache/spark/sql/BroadcastJoinSuite.scala,
line 18
\[r16\]](https://reviewable.io:443/reviews/apache/spark/7770#-JvlAdjyYunIYwbkUn9y)**
([raw
file](https://github.com/apache/spark/blob/c00a197a2f78844f4f18e52364d55116cb0a374a/sql/core/src/test/scala/org/apache/spark/sql/BroadcastJoinSuite.scala#L18)):</sup>
Minor nit: since this is testing something in the `sql.executions.joins_`
package, the test should probably be in that package as well.
---
<sup>**[sql/core/src/test/scala/org/apache/spark/sql/BroadcastJoinSuite.scala,
line 49
\[r16\]](https://reviewable.io:443/reviews/apache/spark/7770#-JvlBEPnOWNFyn4RzkCF)**
([raw
file](https://github.com/apache/spark/blob/c00a197a2f78844f4f18e52364d55116cb0a374a/sql/core/src/test/scala/org/apache/spark/sql/BroadcastJoinSuite.scala#L49)):</sup>
I'm fairly sure that this is going to lead to a "multiple active
SparkContexts in the same JVM" failure, since TestSQLContext might have already
created an active global context before this runs. Fixing this might require
some fairly large refactorings to TestSQLContext and our other testing code, so
for now we may end up having to disable this test. If we need to do that,
let's file a 1.5.0 test blocker JIRA.
---
Comments from the [review on
Reviewable.io](https://reviewable.io:443/reviews/apache/spark/7770)
<!-- Sent from Reviewable.io -->
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]