xinyuezg commented on PR #8965: URL: https://github.com/apache/incubator-gluten/pull/8965#issuecomment-3353116400
> > mismatches > > @xinyuezg Hi, If it is FULL OUTER JOIN and empty join condition, there should be NO mismatched rows in build side, right? Hi @WangGuangxin - that's correct: with FULL OUTER and an empty join condition, there should be no build-side mismatches **when the probe side is non-empty**. The corner case we're hitting is about global vs per-executor emptiness on Spark: * If the probe side is globally empty (i.e. no executor receives any probe rows), the correct result is all build rows padded with NULLs on the probe side. * If the probe side is non-empty overall, but some executors happen to get empty probe partitions, those executors **must not** emit build-mismatch rows. Only the cross product rows from executors that have probe data should be produced; no mismatches at all in this case. In current Velox code, a single Task/driver can decide this correctly because probe+build are co-located. On Spark, the probe side is distributed across executors and Velox lacks a **global "did any executor see probe rows?"** signal. With PR #8965, a driver that saw an empty probe partition may incorrectly emit build-mismatch rows even though other executors did see probe data, causing extra rows. Concretely, consider two executors: * Executor X: probe partition non-empty -> emits cross-join rows (correct). * Executor Y: probe partition empty -> should emit nothing in this scenario (because probe is not globally empty). Today Y can still emit build-side NULL rows, which is wrong. Because Velox doesn't have a task-global aggregation for this today, we recommend falling back to Spark for all FULL OUTER BNLJ cases (condition or not). To support Velox with no condition BNLJ, we probably need to add a global coordination/OR of a `sawProbeRows` flag so that build-mismatch rows are emitted only when probe is globally empty, and only emit once. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
