Github user rednaxelafx commented on the issue:
https://github.com/apache/spark/pull/21039
I just checked the same test in Build 4695, which still has this change,
and the test passed:
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/4695/testReport/junit/org.apache.spark.sql/TPCDSQuerySuite/q61/
Something weird must have happened when running this test in Build 4694...
re:
> shall we just don't do the nulling out? It wouldn't help the GC a lot.
@cloud-fan I can certainly do that -- just mark the transient, removing the
nulling code.
But what I'm concerned about is that the test failure was exposing some
other problem that we might not have anticipated before.
The whole-stage codegen logic in most physical operators assume that
codegen happens on a single thread. As such, it might use instance fields on
the operator to pass state between the `doProduce` and `doConsume` methods, as
is the case with `bufVars` in `HashAggregateExec`.
Now, imagine a scenario where I take a DataFrame `df`:
```scala
val plan = df.queryExecution.executedPlan
// then on Thread1
plan.execute // triggers whole-stage codegen
// at around the same time, on Thread2
plan.execute // also triggers whole-stage codegen
```
Now we're going to be performing whole-stage codegen on 2 different
threads, at around the same time, on the exact same plan (if there are
`HashAggregateExec`s involved, they're gonna be the same instances).
In this case, the two threads might accidentally "cross-talk" because of
the way whole-stage codegen uses instance fields to pass state between
`doProduce` and `doConsume`. It's basically "pure luck" that both threads
finish the codegen correctly (despite one thread might be accidentally using
the state from (overwritten by) another thread).
If this theory holds, nulling out the state introduces a "leak" of the
"cross-talk", so it's now possible to see an NPE if the timing is just right.
But it's fairly scary already even without the nulling...
Anyway, I'll resend this PR with only the `@transient` part of the change,
since that won't cause any regressions for sure.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]