[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22318 In the example @viirya described above (https://github.com/apache/spark/pull/22318#issuecomment-426317617), I think the interpretation is unclear to most users and I'm fairly concerned that it could be error-prone... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user peter-toth commented on the issue: https://github.com/apache/spark/pull/22318 @maropu, I don't think we can. Actually this is how we deal with [simpler joins](https://github.com/apache/spark/pull/22318#issuecomment-427080091) Do you think changing the behaviour is unacceptable? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22318 @peter-toth we cannot fix the issue of the description without changing [the existing behaviour](https://github.com/apache/spark/pull/22318#issuecomment-426317617)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22318 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/22318 @srowen the change seems fine to me as I think this does improve the behavior of attributes deduplication (I think it was a bug not rewriting the join condition with the new references). The point is that in order to have a really correct and expected behavior in all the conditions we need to keep a reference of the dataset an attribute is from. That is why I created #21449 and I know @cloud-fan mentioned he had a similar PR for the same reason. But that is another story. In general I think both this and something like #21449 is needed in order to handle properly all the cases. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22318 Oh I see there was indeed more discussion on this, and it does relate to resolving columns to joined DataFrames. I don't know enough to bless this change, but it seems reasonable. @maropu approves, I think, and so does @mgaido91 ? what about @cloud-fan @viirya ? It seems like it's behavior that can be improved but isn't strictly a correctness issue? because it's not quite valid to reference columns from dataframes not involved in the join as a join condition, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user peter-toth commented on the issue: https://github.com/apache/spark/pull/22318 @srowen, I saw your last comment on https://github.com/peter-toth/spark/tree/SPARK-25150. I submitted this PR to solve that ticket and I believe the description here explains what is the real issue there. I would appreciate your thoughts on this PR, unfortunately it got stuck a bit lately. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user peter-toth commented on the issue: https://github.com/apache/spark/pull/22318 Also please consider that currently (and also after this PR) using `b` and `c` from the description: ``` b.join(c, b("id") === b("id"), "inner") ``` is interpreted as ``` b.join(c, b("id") === c("id"), "inner") ``` so my PR just extends this feature to the case with subsequent joins. @cloud-fan , @viirya what do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user peter-toth commented on the issue: https://github.com/apache/spark/pull/22318 Thanks @viirya, your analysis is correct. Unfortunately an attribute doesn't have a reference to its dataset so I don't think this scenario can be solved easily. I believe the good solution would be something like https://github.com/apache/spark/pull/21449 But I would argue that my PR still does make sense and an `a("id") === b("id")` condition in a join where `c` is joined is not expected, and actually very likely a typo. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/22318 For a query similar to the one in the PR description: `a.join(b, a("id") === b("id"), "inner").join(c, a("id") === b("id"), "inner"` It is interpreted as a cartesian products now. But after this change, it is as the same as the query: `a.join(b, a("id") === b("id"), "inner").join(c, a("id") === c("id"), "inner")` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user peter-toth commented on the issue: https://github.com/apache/spark/pull/22318 @cloud-fan could you please help me with this PR and take it one step forward? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user peter-toth commented on the issue: https://github.com/apache/spark/pull/22318 @cloud-fan, does the new description defines the scope as you suggested? Is there anything I can add to this PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user peter-toth commented on the issue: https://github.com/apache/spark/pull/22318 @cloud-fan , I added some explanation to the description in which cases this PR helps and also where it doesn't. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22318 Can you define the scope of this PR? In which case we should change the references in the join condition? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22318 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22318 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95854/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22318 **[Test build #95854 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95854/testReport)** for PR 22318 at commit [`938bd7f`](https://github.com/apache/spark/commit/938bd7ffab2796a298548f8e9a1a8d2b3f1aa901). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user peter-toth commented on the issue: https://github.com/apache/spark/pull/22318 @cloud-fan this PR doesn't solve that question. There are some hacks in `Dataset.join` to handle `EqualTo` and `EqualNullSafe` with duplicated attributes and those hacks are still required with this PR. But I believe there are other initiatives to solve that long-standing issue like https://github.com/apache/spark/pull/21449 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22318 How does this work? When we have duplicated attributes in the join condition, how can we know which attribute comes from which side? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22318 **[Test build #95854 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95854/testReport)** for PR 22318 at commit [`938bd7f`](https://github.com/apache/spark/commit/938bd7ffab2796a298548f8e9a1a8d2b3f1aa901). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22318 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22318 To make sure we have no regression by this change, I checked the `Analyzer$ResolveReferences`time in TPCDS queries. But, I didn't find actual performance regression. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22318 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22318 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95757/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22318 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22318 **[Test build #95757 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95757/testReport)** for PR 22318 at commit [`938bd7f`](https://github.com/apache/spark/commit/938bd7ffab2796a298548f8e9a1a8d2b3f1aa901). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22318 **[Test build #95757 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95757/testReport)** for PR 22318 at commit [`938bd7f`](https://github.com/apache/spark/commit/938bd7ffab2796a298548f8e9a1a8d2b3f1aa901). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/22318 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22318 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95750/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22318 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22318 **[Test build #95750 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95750/testReport)** for PR 22318 at commit [`938bd7f`](https://github.com/apache/spark/commit/938bd7ffab2796a298548f8e9a1a8d2b3f1aa901). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22318 **[Test build #95750 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95750/testReport)** for PR 22318 at commit [`938bd7f`](https://github.com/apache/spark/commit/938bd7ffab2796a298548f8e9a1a8d2b3f1aa901). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22318 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22318 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95723/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22318 **[Test build #95723 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95723/testReport)** for PR 22318 at commit [`809b8a8`](https://github.com/apache/spark/commit/809b8a83b7ec3d62ba6d65f6aff6a7d3175bd3e3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22318 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95720/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22318 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22318 **[Test build #95720 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95720/testReport)** for PR 22318 at commit [`809b8a8`](https://github.com/apache/spark/commit/809b8a83b7ec3d62ba6d65f6aff6a7d3175bd3e3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22318 **[Test build #95723 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95723/testReport)** for PR 22318 at commit [`809b8a8`](https://github.com/apache/spark/commit/809b8a83b7ec3d62ba6d65f6aff6a7d3175bd3e3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22318 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22318 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95714/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22318 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22318 **[Test build #95714 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95714/testReport)** for PR 22318 at commit [`d56f33c`](https://github.com/apache/spark/commit/d56f33cea56f5ad000e6638c7645a1d13e85eb55). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class AttributeMap[+A](val baseMap: Map[ExprId, (Attribute, A)])` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/22318 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22318 **[Test build #95720 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95720/testReport)** for PR 22318 at commit [`809b8a8`](https://github.com/apache/spark/commit/809b8a83b7ec3d62ba6d65f6aff6a7d3175bd3e3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22318: [SPARK-25150][SQL] Rewrite condition when deduplicate Jo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22318 **[Test build #95714 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95714/testReport)** for PR 22318 at commit [`d56f33c`](https://github.com/apache/spark/commit/d56f33cea56f5ad000e6638c7645a1d13e85eb55). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org