shrirangmhalgi commented on PR #56620: URL: https://github.com/apache/spark/pull/56620#issuecomment-4754884173
**0 blocking, 1 non-blocking, 1 nit** This is a well-motivated fix for a real AQE correctness issue; the partition-keyed accumulator design is sound. Two items below. ### Verification Traced the race window end-to-end: `TableCacheQueryStageExec.doMaterialize()` submits a `submitJob` per reference → duplicate partition computes land on distinct executors → the old `materializedPartitions.add(1L)` inflates past `numPartitions` while the row-producing partition is still building → `isCachedRDDLoaded` latches true → `computeStats()` enters the materialized branch with `rowCountStats.value == 0` → `AQEPropagateEmptyRelation.getEstimatedRowCount` reads `rowCount == Some(0)` → empty relation. **With the fix**: `ConcurrentHashMap.put` is last-write-wins per partition ID, `size()` is linearizable, so `accumulatedNumPartitions` cannot exceed the number of distinct completed partitions. The `synchronized` on `isCachedRDDLoaded` and `clearCache` share the same monitor — the one-way latch cannot be poisoned by a concurrent reset. DAGScheduler event loop serializes `merge` calls. ### Design / architecture (1) `InMemoryRelation.scala` – `Either[LegacyAccumulators, PartitionKeyedAccumulator]`: the left branch re-enables the exact bug this PR fixes. If the intent is a safety switch, a conf that holds off the loaded-latch entirely (always report static plan stats until the RDD action returns) would be safer than re-enabling broken counting. Otherwise, always using the keyed accumulator and dropping the dual-path branching simplifies every read site significantly. - see inline ### Suggestions (1) `InMemoryRelation.scala` – "exact, de-duplicated row count" comment: overstates precision for indeterminate stages - see inline -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
