shrirangmhalgi commented on PR #56620:
URL: https://github.com/apache/spark/pull/56620#issuecomment-4754884173

   **0 blocking, 1 non-blocking, 1 nit**
   
   This is a well-motivated fix for a real AQE correctness issue; the 
partition-keyed accumulator design is sound. Two items below.
   
   ### Verification
   Traced the race window end-to-end: 
`TableCacheQueryStageExec.doMaterialize()` submits a `submitJob` per reference 
→ duplicate partition computes land on distinct executors → the old 
`materializedPartitions.add(1L)` inflates past `numPartitions` while the 
row-producing partition is still building → `isCachedRDDLoaded` latches true → 
`computeStats()` enters the materialized branch with `rowCountStats.value == 0` 
→ `AQEPropagateEmptyRelation.getEstimatedRowCount` reads `rowCount == Some(0)` 
→ empty relation.
   
   **With the fix**: `ConcurrentHashMap.put` is last-write-wins per partition 
ID, `size()` is linearizable, so `accumulatedNumPartitions` cannot exceed the 
number of distinct completed partitions. The `synchronized` on 
`isCachedRDDLoaded` and `clearCache` share the same monitor — the one-way latch 
cannot be poisoned by a concurrent reset. DAGScheduler event loop serializes 
`merge` calls. 
   
   ### Design / architecture (1)
   `InMemoryRelation.scala` – `Either[LegacyAccumulators, 
PartitionKeyedAccumulator]`: the left branch re-enables the exact bug this PR 
fixes. If the intent is a safety switch, a conf that holds off the loaded-latch 
entirely (always report static plan stats until the RDD action returns) would 
be safer than re-enabling broken counting. Otherwise, always using the keyed 
accumulator and dropping the dual-path branching simplifies every read site 
significantly. - see inline
   
   ### Suggestions (1)
   `InMemoryRelation.scala` – "exact, de-duplicated row count" comment: 
overstates precision for indeterminate stages - see inline
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to