Re: Weird results with Spark SQL Outer joins

Gourav Sengupta Tue, 03 May 2016 09:38:01 -0700

Hi Kevin,

Having given it a first look I do think that you have hit something here
and this does not look quite fine. I have to work on the multiple AND
conditions in ON and see whether that is causing any issues.


Regards,
Gourav Sengupta

On Tue, May 3, 2016 at 8:28 AM, Kevin Peng <kpe...@gmail.com> wrote:

> Davies,
>
> Here is the code that I am typing into the spark-shell along with the
> results (my question is at the bottom):
>
> val dps =
> sqlContext.read.format("com.databricks.spark.csv").option("header",
> "true").load("file:///home/ltu/dps_csv/")
> val swig =
> sqlContext.read.format("com.databricks.spark.csv").option("header",
> "true").load("file:///home/ltu/swig_csv/")
>
> dps.count
> res0: Long = 42694
>
> swig.count
> res1: Long = 42034
>
>
> dps.registerTempTable("dps_pin_promo_lt")
> swig.registerTempTable("swig_pin_promo_lt")
>
> sqlContext.sql("select * from dps_pin_promo_lt where date >
> '2016-01-03'").count
> res4: Long = 42666
>
> sqlContext.sql("select * from swig_pin_promo_lt where date >
> '2016-01-03'").count
> res5: Long = 34131
>
> sqlContext.sql("select distinct date, account, ad from dps_pin_promo_lt
> where date > '2016-01-03'").count
> res6: Long = 42533
>
> sqlContext.sql("select distinct date, account, ad from swig_pin_promo_lt
> where date > '2016-01-03'").count
> res7: Long = 34131
>
>
> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
> = d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
> res9: Long = 23809
>
>
> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN
> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
> = d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
> res10: Long = 23809
>
>
> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN
> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
> = d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
> res11: Long = 23809
>
>
> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN
> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
> = d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
> res12: Long = 23809
>
>
>
> From my results above, we notice that the counts of distinct values based
> on the join criteria and filter criteria for each individual table is
> located at res6 and res7.  My question is why is the outer join producing
> less rows than the smallest table; if there are no matches it should still
> bring in that row as part of the outer join.  For the full and right outer
> join I am expecting to see a minimum of res6 rows, but I get less, is there
> something specific that I am missing here?  I am expecting that the full
> outer join would give me the union of the two table sets so I am expecting
> at least 42533 rows not 23809.
>
>
> Gourav,
>
> I just ran this result set on a new session with slightly newer data...
> still seeing those results.
>
>
>
> Thanks,
>
> KP
>
>
> On Mon, May 2, 2016 at 11:16 PM, Davies Liu <dav...@databricks.com> wrote:
>
>> as @Gourav said, all the join with different join type show the same
>> results,
>> which meant that all the rows from left could match at least one row from
>> right,
>> all the rows from right could match at least one row from left, even
>> the number of row from left does not equal that of right.
>>
>> This is correct result.
>>
>> On Mon, May 2, 2016 at 2:13 PM, Kevin Peng <kpe...@gmail.com> wrote:
>> > Yong,
>> >
>> > Sorry, let explain my deduction; it is going be difficult to get a
>> sample
>> > data out since the dataset I am using is proprietary.
>> >
>> > From the above set queries (ones mentioned in above comments) both
>> inner and
>> > outer join are producing the same counts.  They are basically pulling
>> out
>> > selected columns from the query, but there is no roll up happening or
>> > anything that would possible make it suspicious that there is any
>> difference
>> > besides the type of joins.  The tables are matched based on three keys
>> that
>> > are present in both tables (ad, account, and date), on top of this they
>> are
>> > filtered by date being above 2016-01-03.  Since all the joins are
>> producing
>> > the same counts, the natural suspicions is that the tables are
>> identical,
>> > but I when I run the following two queries:
>> >
>> > scala> sqlContext.sql("select * from swig_pin_promo_lt where date
>> >>='2016-01-03'").count
>> >
>> > res14: Long = 34158
>> >
>> > scala> sqlContext.sql("select * from dps_pin_promo_lt where date
>> >>='2016-01-03'").count
>> >
>> > res15: Long = 42693
>> >
>> >
>> > The above two queries filter out the data based on date used by the
>> joins of
>> > 2016-01-03 and you can see the row count between the two tables are
>> > different, which is why I am suspecting something is wrong with the
>> outer
>> > joins in spark sql, because in this situation the right and outer joins
>> may
>> > produce the same results, but it should not be equal to the left join
>> and
>> > definitely not the inner join; unless I am missing something.
>> >
>> >
>> > Side note: In my haste response above I posted the wrong counts for
>> > dps.count, the real value is res16: Long = 42694
>> >
>> >
>> > Thanks,
>> >
>> >
>> > KP
>> >
>> >
>> >
>> >
>> > On Mon, May 2, 2016 at 12:50 PM, Yong Zhang <java8...@hotmail.com>
>> wrote:
>> >>
>> >> We are still not sure what is the problem, if you cannot show us with
>> some
>> >> example data.
>> >>
>> >> For dps with 42632 rows, and swig with 42034 rows, if dps full outer
>> join
>> >> with swig on 3 columns; with additional filters, get the same
>> resultSet row
>> >> count as dps lefter outer join with swig on 3 columns, with additional
>> >> filters, again get the the same resultSet row count as dps right outer
>> join
>> >> with swig on 3 columns, with same additional filters.
>> >>
>> >> Without knowing your data, I cannot see the reason that has to be a
>> bug in
>> >> the spark.
>> >>
>> >> Am I misunderstanding your bug?
>> >>
>> >> Yong
>> >>
>> >> ________________________________
>> >> From: kpe...@gmail.com
>> >> Date: Mon, 2 May 2016 12:11:18 -0700
>> >> Subject: Re: Weird results with Spark SQL Outer joins
>> >> To: gourav.sengu...@gmail.com
>> >> CC: user@spark.apache.org
>> >>
>> >>
>> >> Gourav,
>> >>
>> >> I wish that was case, but I have done a select count on each of the two
>> >> tables individually and they return back different number of rows:
>> >>
>> >>
>> >> dps.registerTempTable("dps_pin_promo_lt")
>> >> swig.registerTempTable("swig_pin_promo_lt")
>> >>
>> >>
>> >> dps.count()
>> >> RESULT: 42632
>> >>
>> >>
>> >> swig.count()
>> >> RESULT: 42034
>> >>
>> >> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta
>> >> <gourav.sengu...@gmail.com> wrote:
>> >>
>> >> This shows that both the tables have matching records and no
>> mismatches.
>> >> Therefore obviously you have the same results irrespective of whether
>> you
>> >> use right or left join.
>> >>
>> >> I think that there is no problem here, unless I am missing something.
>> >>
>> >> Regards,
>> >> Gourav
>> >>
>> >> On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kpe...@gmail.com> wrote:
>> >>
>> >> Also, the results of the inner query produced the same results:
>> >> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
>> d.account
>> >> AS
>> >> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>> >> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
>> >> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>> s.ad
>> >> =
>> >> d.ad) WHERE s.date >= '2016-01-03'    AND d.date >=
>> '2016-01-03'").count()
>> >> RESULT:23747
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
>> >> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: user-h...@spark.apache.org
>> >>
>> >>
>> >>
>> >
>>
>
>

Re: Weird results with Spark SQL Outer joins

Reply via email to