Hi Kevin, Having given it a first look I do think that you have hit something here and this does not look quite fine. I have to work on the multiple AND conditions in ON and see whether that is causing any issues.
Regards, Gourav Sengupta On Tue, May 3, 2016 at 8:28 AM, Kevin Peng <kpe...@gmail.com> wrote: > Davies, > > Here is the code that I am typing into the spark-shell along with the > results (my question is at the bottom): > > val dps = > sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").load("file:///home/ltu/dps_csv/") > val swig = > sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").load("file:///home/ltu/swig_csv/") > > dps.count > res0: Long = 42694 > > swig.count > res1: Long = 42034 > > > dps.registerTempTable("dps_pin_promo_lt") > swig.registerTempTable("swig_pin_promo_lt") > > sqlContext.sql("select * from dps_pin_promo_lt where date > > '2016-01-03'").count > res4: Long = 42666 > > sqlContext.sql("select * from swig_pin_promo_lt where date > > '2016-01-03'").count > res5: Long = 34131 > > sqlContext.sql("select distinct date, account, ad from dps_pin_promo_lt > where date > '2016-01-03'").count > res6: Long = 42533 > > sqlContext.sql("select distinct date, account, ad from swig_pin_promo_lt > where date > '2016-01-03'").count > res7: Long = 34131 > > > sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account > AS d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , > d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN > dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND s.ad > = d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count() > res9: Long = 23809 > > > sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account > AS d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , > d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN > dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND s.ad > = d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count() > res10: Long = 23809 > > > sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account > AS d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , > d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN > dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND s.ad > = d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count() > res11: Long = 23809 > > > sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account > AS d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , > d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN > dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND s.ad > = d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count() > res12: Long = 23809 > > > > From my results above, we notice that the counts of distinct values based > on the join criteria and filter criteria for each individual table is > located at res6 and res7. My question is why is the outer join producing > less rows than the smallest table; if there are no matches it should still > bring in that row as part of the outer join. For the full and right outer > join I am expecting to see a minimum of res6 rows, but I get less, is there > something specific that I am missing here? I am expecting that the full > outer join would give me the union of the two table sets so I am expecting > at least 42533 rows not 23809. > > > Gourav, > > I just ran this result set on a new session with slightly newer data... > still seeing those results. > > > > Thanks, > > KP > > > On Mon, May 2, 2016 at 11:16 PM, Davies Liu <dav...@databricks.com> wrote: > >> as @Gourav said, all the join with different join type show the same >> results, >> which meant that all the rows from left could match at least one row from >> right, >> all the rows from right could match at least one row from left, even >> the number of row from left does not equal that of right. >> >> This is correct result. >> >> On Mon, May 2, 2016 at 2:13 PM, Kevin Peng <kpe...@gmail.com> wrote: >> > Yong, >> > >> > Sorry, let explain my deduction; it is going be difficult to get a >> sample >> > data out since the dataset I am using is proprietary. >> > >> > From the above set queries (ones mentioned in above comments) both >> inner and >> > outer join are producing the same counts. They are basically pulling >> out >> > selected columns from the query, but there is no roll up happening or >> > anything that would possible make it suspicious that there is any >> difference >> > besides the type of joins. The tables are matched based on three keys >> that >> > are present in both tables (ad, account, and date), on top of this they >> are >> > filtered by date being above 2016-01-03. Since all the joins are >> producing >> > the same counts, the natural suspicions is that the tables are >> identical, >> > but I when I run the following two queries: >> > >> > scala> sqlContext.sql("select * from swig_pin_promo_lt where date >> >>='2016-01-03'").count >> > >> > res14: Long = 34158 >> > >> > scala> sqlContext.sql("select * from dps_pin_promo_lt where date >> >>='2016-01-03'").count >> > >> > res15: Long = 42693 >> > >> > >> > The above two queries filter out the data based on date used by the >> joins of >> > 2016-01-03 and you can see the row count between the two tables are >> > different, which is why I am suspecting something is wrong with the >> outer >> > joins in spark sql, because in this situation the right and outer joins >> may >> > produce the same results, but it should not be equal to the left join >> and >> > definitely not the inner join; unless I am missing something. >> > >> > >> > Side note: In my haste response above I posted the wrong counts for >> > dps.count, the real value is res16: Long = 42694 >> > >> > >> > Thanks, >> > >> > >> > KP >> > >> > >> > >> > >> > On Mon, May 2, 2016 at 12:50 PM, Yong Zhang <java8...@hotmail.com> >> wrote: >> >> >> >> We are still not sure what is the problem, if you cannot show us with >> some >> >> example data. >> >> >> >> For dps with 42632 rows, and swig with 42034 rows, if dps full outer >> join >> >> with swig on 3 columns; with additional filters, get the same >> resultSet row >> >> count as dps lefter outer join with swig on 3 columns, with additional >> >> filters, again get the the same resultSet row count as dps right outer >> join >> >> with swig on 3 columns, with same additional filters. >> >> >> >> Without knowing your data, I cannot see the reason that has to be a >> bug in >> >> the spark. >> >> >> >> Am I misunderstanding your bug? >> >> >> >> Yong >> >> >> >> ________________________________ >> >> From: kpe...@gmail.com >> >> Date: Mon, 2 May 2016 12:11:18 -0700 >> >> Subject: Re: Weird results with Spark SQL Outer joins >> >> To: gourav.sengu...@gmail.com >> >> CC: user@spark.apache.org >> >> >> >> >> >> Gourav, >> >> >> >> I wish that was case, but I have done a select count on each of the two >> >> tables individually and they return back different number of rows: >> >> >> >> >> >> dps.registerTempTable("dps_pin_promo_lt") >> >> swig.registerTempTable("swig_pin_promo_lt") >> >> >> >> >> >> dps.count() >> >> RESULT: 42632 >> >> >> >> >> >> swig.count() >> >> RESULT: 42034 >> >> >> >> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta >> >> <gourav.sengu...@gmail.com> wrote: >> >> >> >> This shows that both the tables have matching records and no >> mismatches. >> >> Therefore obviously you have the same results irrespective of whether >> you >> >> use right or left join. >> >> >> >> I think that there is no problem here, unless I am missing something. >> >> >> >> Regards, >> >> Gourav >> >> >> >> On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kpe...@gmail.com> wrote: >> >> >> >> Also, the results of the inner query produced the same results: >> >> sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , >> d.account >> >> AS >> >> d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , >> >> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN >> >> dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND >> s.ad >> >> = >> >> d.ad) WHERE s.date >= '2016-01-03' AND d.date >= >> '2016-01-03'").count() >> >> RESULT:23747 >> >> >> >> >> >> >> >> -- >> >> View this message in context: >> >> >> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html >> >> Sent from the Apache Spark User List mailing list archive at >> Nabble.com. >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >> >> >> >> >> > >> > >