Github user rednaxelafx commented on the issue:

    https://github.com/apache/spark/pull/21039
  
    I just checked the same test in Build 4695, which still has this change, 
and the test passed:
    
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/4695/testReport/junit/org.apache.spark.sql/TPCDSQuerySuite/q61/
    Something weird must have happened when running this test in Build 4694...
    
    re:
    > shall we just don't do the nulling out? It wouldn't help the GC a lot.
    @cloud-fan I can certainly do that -- just mark the transient, removing the 
nulling code.
    But what I'm concerned about is that the test failure was exposing some 
other problem that we might not have anticipated before.
    
    The whole-stage codegen logic in most physical operators assume that 
codegen happens on a single thread. As such, it might use instance fields on 
the operator to pass state between the `doProduce` and `doConsume` methods, as 
is the case with `bufVars` in `HashAggregateExec`.
    Now, imagine a scenario where I take a DataFrame `df`:
    ```scala
    val plan = df.queryExecution.executedPlan
    
    // then on Thread1
    plan.execute // triggers whole-stage codegen
    
    // at around the same time, on Thread2
    plan.execute // also triggers whole-stage codegen
    ```
    Now we're going to be performing whole-stage codegen on 2 different 
threads, at around the same time, on the exact same plan (if there are 
`HashAggregateExec`s involved, they're gonna be the same instances).
    In this case, the two threads might accidentally "cross-talk" because of 
the way whole-stage codegen uses instance fields to pass state between 
`doProduce` and `doConsume`. It's basically "pure luck" that both threads 
finish the codegen correctly (despite one thread might be accidentally using 
the state from (overwritten by) another thread).
    
    If this theory holds, nulling out the state introduces a "leak" of the 
"cross-talk", so it's now possible to see an NPE if the timing is just right. 
But it's fairly scary already even without the nulling...
    
    Anyway, I'll resend this PR with only the `@transient` part of the change, 
since that won't cause any regressions for sure.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to