LuciferYang edited a comment on pull request #29638:
URL: https://github.com/apache/spark/pull/29638#issuecomment-686890566
> Hmm, why this is needed? Firstly I thought CostBasedJoinReorder will
produce non-deterministic for same query. But I looked at the JIRA description,
seems for different input, the rule will produce different output. Doesn't it
sound reasonable? Different input causes different output.
@viirya viirya Sorry, I didn't describe it clearly. Actually, there are 2
problems we found in SPARK-32526:
1. For same Scala version, different input causes different output as I
describe in SPARK-32687, for example:
```
d1.join(t3).join(t4).join(f1).join(d3).join(d2)
.where((nameToAttr("d1_c2") === nameToAttr("t3_c1")) &&
(nameToAttr("t3_c2") === nameToAttr("t4_c2")) &&
(nameToAttr("d1_pk") === nameToAttr("f1_fk1")) &&
(nameToAttr("f1_fk2") === nameToAttr("d2_pk")) &&
(nameToAttr("f1_fk3") === nameToAttr("d3_pk")))
```
and
```
d1.join(t3).join(f1).join(d2).join(t4).join(d3)
.where((nameToAttr("d1_c2") === nameToAttr("t3_c1")) &&
(nameToAttr("t3_c2") === nameToAttr("t4_c2")) &&
(nameToAttr("d1_pk") === nameToAttr("f1_fk1")) &&
(nameToAttr("f1_fk2") === nameToAttr("d2_pk")) &&
(nameToAttr("f1_fk3") === nameToAttr("d3_pk")))
```
have different optimization result, I think this is acceptable if the
candidates have same cost, but @cloud-fan has some different view in
https://github.com/apache/spark/pull/29434, I'm not sure I understand it
correctly.
2. For different Scala version (2.12 vs 2.13), same input maybe causes
different output, for example
```
d1.join(t3).join(t4).join(f1).join(d2).join(t5).join(t6).join(d3).join(t1).join(t2)
.where((nameToAttr("d1_c2") === nameToAttr("t3_c1")) &&
(nameToAttr("t3_c2") === nameToAttr("t4_c2")) &&
(nameToAttr("d1_pk") === nameToAttr("f1_fk1")) &&
(nameToAttr("f1_fk2") === nameToAttr("d2_pk")) &&
(nameToAttr("d2_c2") === nameToAttr("t5_c1")) &&
(nameToAttr("t5_c2") === nameToAttr("t6_c2")) &&
(nameToAttr("f1_fk3") === nameToAttr("d3_pk")) &&
(nameToAttr("d3_c2") === nameToAttr("t1_c1")) &&
(nameToAttr("t1_c2") === nameToAttr("t2_c2")))
```
in Scala 2.12 and Scala 2.13 have different optimization result. This pr
also can fix this problem. If everyone thinks that `different input causes
different output` is reasonable, I will close this first. But maybe we also
need resolve problem 2, I will describe the problem in another jira based on
problem 2 and try to fix it.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]