[ https://issues.apache.org/jira/browse/SPARK-20313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15972142#comment-15972142 ]
Takeshi Yamamuro commented on SPARK-20313: ------------------------------------------ What's the issue that you'd like to point out? I think the description is ambiguous. > Possible lack of join optimization when partitions are in the join condition > ---------------------------------------------------------------------------- > > Key: SPARK-20313 > URL: https://issues.apache.org/jira/browse/SPARK-20313 > Project: Spark > Issue Type: Improvement > Components: Optimizer > Affects Versions: 2.1.0 > Reporter: Albert Meltzer > > Given two tables T1 and T2, partitioned on column part1, the following have > vastly different execution performance: > // initial, slow > {noformat} > val df1 = // load data from T1 > .filter(functions.col("part1").between("val1", "val2") > val df2 = // load data from T2 > .filter(functions.col("part1").between("val1", "val2") > val df3 = df1.join(df2, Seq("part1", "col1")) > {noformat} > // manually optimized, considerably faster > {noformat} > val df1 = // load data from T1 > val df2 = // load data from T2 > val part1values = Seq(...) // a collection of values between val1 and val2 > val df3 = part1values > .map(part1value => { > val df1filtered = df1.filter(functions.col("part1") === part1value) > val df2filtered = df2.filter(functions.col("part1") === part1value) > df1filtered.join(df2filtered, Seq("col1")) // part1 removed from join > }) > .reduce(_ union _) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org