Github user rednaxelafx commented on the issue: https://github.com/apache/spark/pull/21039 I just checked the same test in Build 4695, which still has this change, and the test passed: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/4695/testReport/junit/org.apache.spark.sql/TPCDSQuerySuite/q61/ Something weird must have happened when running this test in Build 4694... re: > shall we just don't do the nulling out? It wouldn't help the GC a lot. @cloud-fan I can certainly do that -- just mark the transient, removing the nulling code. But what I'm concerned about is that the test failure was exposing some other problem that we might not have anticipated before. The whole-stage codegen logic in most physical operators assume that codegen happens on a single thread. As such, it might use instance fields on the operator to pass state between the `doProduce` and `doConsume` methods, as is the case with `bufVars` in `HashAggregateExec`. Now, imagine a scenario where I take a DataFrame `df`: ```scala val plan = df.queryExecution.executedPlan // then on Thread1 plan.execute // triggers whole-stage codegen // at around the same time, on Thread2 plan.execute // also triggers whole-stage codegen ``` Now we're going to be performing whole-stage codegen on 2 different threads, at around the same time, on the exact same plan (if there are `HashAggregateExec`s involved, they're gonna be the same instances). In this case, the two threads might accidentally "cross-talk" because of the way whole-stage codegen uses instance fields to pass state between `doProduce` and `doConsume`. It's basically "pure luck" that both threads finish the codegen correctly (despite one thread might be accidentally using the state from (overwritten by) another thread). If this theory holds, nulling out the state introduces a "leak" of the "cross-talk", so it's now possible to see an NPE if the timing is just right. But it's fairly scary already even without the nulling... Anyway, I'll resend this PR with only the `@transient` part of the change, since that won't cause any regressions for sure.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org