[ https://issues.apache.org/jira/browse/SPARK-17891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun reopened SPARK-17891: ----------------------------------- > SQL-based three column join loses first column > ---------------------------------------------- > > Key: SPARK-17891 > URL: https://issues.apache.org/jira/browse/SPARK-17891 > Project: Spark > Issue Type: Bug > Affects Versions: 2.0.1 > Reporter: Eli Miller > Attachments: test.tgz > > > Hi all, > I hope that this is not a known issue (I haven't had any luck finding > anything similar in Jira or the mailing lists but I could be searching with > the wrong terms). I just started to experiment with Spark SQL and am seeing > what appears to be a bug. When using Spark SQL to join two tables with a > three column inner join, the first column join is ignored. The example code > that I have starts with two tables *T1*: > {noformat} > +---+---+---+---+ > | a| b| c| d| > +---+---+---+---+ > | 1| 2| 3| 4| > +---+---+---+---+ > {noformat} > and *T2*: > {noformat} > +---+---+---+---+ > | b| c| d| e| > +---+---+---+---+ > | 2| 3| 4| 5| > | -2| 3| 4| 6| > | 2| -3| 4| 7| > +---+---+---+---+ > {noformat} > Joining *T1* to *T2* on *b*, *c* and *d* (in that order): > {code:sql} > SELECT t1.a, t1.b, t2.b, t1.c,t2.c, t1.d, t2.d, t2.e > FROM t1, t2 > WHERE t1.b = t2.b AND t1.c = t2.c AND t1.d = t2.d > {code} > results in the following (note that *T1.b* != *T2.b* in the first row): > {noformat} > +---+---+---+---+---+---+---+---+ > | a| b| b| c| c| d| d| e| > +---+---+---+---+---+---+---+---+ > | 1| 2| -2| 3| 3| 4| 4| 6| > | 1| 2| 2| 3| 3| 4| 4| 5| > +---+---+---+---+---+---+---+---+ > {noformat} > Switching the predicate order to *c*, *b* and *d*: > {code:sql} > SELECT t1.a, t1.b, t2.b, t1.c,t2.c, t1.d, t2.d, t2.e > FROM t1, t2 > WHERE t1.c = t2.c AND t1.b = t2.b AND t1.d = t2.d > {code} > yields different results (now *T1.c* != *T2.c* in the first row): > {noformat} > +---+---+---+---+---+---+---+---+ > | a| b| b| c| c| d| d| e| > +---+---+---+---+---+---+---+---+ > | 1| 2| 2| 3| -3| 4| 4| 7| > | 1| 2| 2| 3| 3| 4| 4| 5| > +---+---+---+---+---+---+---+---+ > {noformat} > Is this expected? > I started to research this a bit and one thing that jumped out at me was the > ordering of the HashedRelationBroadcastMode concatenation in the plan (this > is from the *b*, *c*, *d* ordering): > {noformat} > ... > *Project [a#0, b#1, b#9, c#2, c#10, d#3, d#11, e#12] > +- *BroadcastHashJoin [b#1, c#2, d#3], [b#9, c#10, d#11], Inner, BuildRight > :- *Project [a#0, b#1, c#2, d#3] > : +- *Filter ((isnotnull(b#1) && isnotnull(c#2)) && isnotnull(d#3)) > : +- *Scan csv [a#0,b#1,c#2,d#3] Format: CSV, InputPaths: > file:/home/eli/git/IENG/what/target/classes/t1.csv, PartitionFilters: [], > PushedFilters: [IsNotNull(b), IsNotNull(c), IsNotNull(d)], ReadSchema: > struct<a:int,b:int,c:int,d:int> > +- BroadcastExchange > HashedRelationBroadcastMode(List((shiftleft((shiftleft(cast(input[0, int, > true] as bigint), 32) | (cast(input[1, int, true] as bigint) & 4294967295)), > 32) | (cast(input[2, int, true] as bigint) & 4294967295)))) > +- *Project [b#9, c#10, d#11, e#12] > +- *Filter ((isnotnull(c#10) && isnotnull(b#9)) && isnotnull(d#11)) > +- *Scan csv [b#9,c#10,d#11,e#12] Format: CSV, InputPaths: > file:/home/eli/git/IENG/what/target/classes/t2.csv, PartitionFilters: [], > PushedFilters: [IsNotNull(c), IsNotNull(b), IsNotNull(d)], ReadSchema: > struct<b:int,c:int,d:int,e:int>] > {noformat} > If this concatenated byte array is ever truncated to 64 bits in a comparison, > the leading column will be lost and could result in this behavior. > I will attach my example code and data. Please let me know if I can provide > any other details. > Best regards, > Eli -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org