Re: [SQL] Syntax "case when" doesn't be supported in JOIN

Chang Chen Mon, 17 Jul 2017 19:15:07 -0700

Sorry, I didn't express clearly.  I think the evaluation order doesn't
matter in the context of join implementation(sort or hash based). it should
only refer to join key.



Thanks
Chang

On Tue, Jul 18, 2017 at 7:57 AM, Liang-Chi Hsieh <vii...@gmail.com> wrote:

>
> Evaluation order does matter. A non-deterministic expression can change its
> output due to internal state which may depend on input order.
>
> MonotonicallyIncreasingID is an example for the stateful expression. Once
> you change the row order, the evaluation results are different.
>
>
>
> Chang Chen wrote
> > I see.
> >
> > Actually, it isn't about evaluation order which user can't specify. It's
> > about how many times we evaluate the non-deterministic expression for the
> > same row.
> >
> > For example, given the SQL:
> >
> > SELECT a.col1
> > FROM tbl1 a
> > LEFT OUTER JOIN tbl2 b
> > ON
> >  CASE WHEN a.col2 IS NULL TNEN cast(rand(9)*1000 - 9999999999 as string)
> > ELSE a.col2 END
> >         =
> >  CASE WHEN b.col3 IS NULL TNEN cast(rand(9)*1000 - 9999999999 as string)
> > ELSE b.col3 END;
> >
> > I think if we exactly evaluate   join key one time for each row of a and
> b
> > in the whole pipeline, even if the result isn't deterministic, but the
> > computation is correct.
> >
> > Thanks
> > Chang
> >
> >
> > On Mon, Jul 17, 2017 at 10:49 PM, Liang-Chi Hsieh &lt;
>
> > viirya@
>
> > &gt; wrote:
> >
> >>
> >> IIUC, the evaluation order of rows in Join can be different in different
> >> physical operators, e.g., Sort-based and Hash-based.
> >>
> >> But for non-deterministic expressions, different evaluation orders
> change
> >> results.
> >>
> >>
> >>
> >> Chang Chen wrote
> >> > I see the issue. I will try https://github.com/apache/
> spark/pull/18652,
> >> I
> >> > think
> >> >
> >> > 1 For Join Operator, the left and right plan can't be
> >> non-deterministic.
> >> > 2 If  Filter can support non-deterministic, why not join condition?
> >> > 3 We can't push down or project non-deterministic expression, since it
> >> may
> >> > change semantics.
> >> >
> >> > Actually, the real problem is #2. If the join condition could be
> >> > non-deterministic, then we needn't insert project.
> >> >
> >> > Thanks
> >> > Chang
> >> >
> >> >
> >> >
> >> >
> >> > On Mon, Jul 17, 2017 at 3:59 PM, 蒋星博 &lt;
> >>
> >> > jiangxb1987@
> >>
> >> > &gt; wrote:
> >> >
> >> >> FYI there have been a related discussion here:
> >> https://github.com/apache/
> >> >> spark/pull/15417#discussion_r85295977
> >> >>
> >> >> 2017-07-17 15:44 GMT+08:00 Chang Chen &lt;
> >>
> >> > baibaichen@
> >>
> >> > &gt;:
> >> >>
> >> >>> Hi All
> >> >>>
> >> >>> I don't understand the difference between the semantics, I found
> >> Spark
> >> >>> does the same thing for GroupBy non-deterministic. From Map-Reduce
> >> point
> >> >>> of
> >> >>> view, Join is also GroupBy in essence .
> >> >>>
> >> >>> @Liang Chi Hsieh
> >> >>> &lt;https://plus.google.com/u/0/103179362592085650735?prsrc=4&gt;
> >> >>>
> >> >>> in which situation,  semantics  will be changed?
> >> >>>
> >> >>> Thanks
> >> >>> Chang
> >> >>>
> >> >>> On Mon, Jul 17, 2017 at 3:29 PM, Liang-Chi Hsieh &lt;
> >>
> >> > viirya@
> >>
> >> > &gt;
> >> >>> wrote:
> >> >>>
> >> >>>>
> >> >>>> Thinking about it more, I think it changes the semantics only under
> >> >>>> certain
> >> >>>> scenarios.
> >> >>>>
> >> >>>> For the example SQL query shown in previous discussion, it looks
> the
> >> >>>> same
> >> >>>> semantics.
> >> >>>>
> >> >>>>
> >> >>>> Xiao Li wrote
> >> >>>> > If the join condition is non-deterministic, pushing it down to
> the
> >> >>>> > underlying project will change the semantics. Thus, we are unable
> >> to
> >> >>>> do it
> >> >>>> > in PullOutNondeterministic. Users can do it manually if they do
> >> not
> >> >>>> care
> >> >>>> > the semantics difference.
> >> >>>> >
> >> >>>> > Thanks,
> >> >>>> >
> >> >>>> > Xiao
> >> >>>> >
> >> >>>> >
> >> >>>> >
> >> >>>> > 2017-07-16 20:07 GMT-07:00 Chang Chen &lt;
> >> >>>>
> >> >>>> > baibaichen@
> >> >>>>
> >> >>>> > &gt;:
> >> >>>> >
> >> >>>> >> It is tedious since we have lots of Hive SQL being migrated to
> >> >>>> Spark.
> >> >>>> >> And
> >> >>>> >> this workaround is equivalent  to insert a Project between Join
> >> >>>> operator
> >> >>>> >> and its child.
> >> >>>> >>
> >> >>>> >> Why not do it in PullOutNondeterministic?
> >> >>>> >>
> >> >>>> >> Thanks
> >> >>>> >> Chang
> >> >>>> >>
> >> >>>> >>
> >> >>>> >> On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh &lt;
> >> >>>>
> >> >>>> > viirya@
> >> >>>>
> >> >>>> > &gt; wrote:
> >> >>>> >>
> >> >>>> >>>
> >> >>>> >>> A possible workaround is to add the rand column into tbl1 with
> a
> >> >>>> >>> projection
> >> >>>> >>> before the join.
> >> >>>> >>>
> >> >>>> >>> SELECT a.col1
> >> >>>> >>> FROM (
> >> >>>> >>>   SELECT col1,
> >> >>>> >>>     CASE
> >> >>>> >>>          WHEN col2 IS NULL
> >> >>>> >>>            THEN cast(rand(9)*1000 - 9999999999 as string)
> >> >>>> >>>          ELSE
> >> >>>> >>>            col2
> >> >>>> >>>     END AS col2
> >> >>>> >>>     FROM tbl1) a
> >> >>>> >>> LEFT OUTER JOIN tbl2 b
> >> >>>> >>> ON a.col2 = b.col3;
> >> >>>> >>>
> >> >>>> >>>
> >> >>>> >>>
> >> >>>> >>> Chang Chen wrote
> >> >>>> >>> > Hi Wenchen
> >> >>>> >>> >
> >> >>>> >>> > Yes. We also find this error is caused by Rand. However, this
> >> is
> >> >>>> >>> classic
> >> >>>> >>> > way to solve data skew in Hive.  Is there any equivalent way
> >> in
> >> >>>> Spark?
> >> >>>> >>> >
> >> >>>> >>> > Thanks
> >> >>>> >>> > Chang
> >> >>>> >>> >
> >> >>>> >>> > On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan &lt;
> >> >>>> >>>
> >> >>>> >>> > cloud0fan@
> >> >>>> >>>
> >> >>>> >>> > &gt; wrote:
> >> >>>> >>> >
> >> >>>> >>> >> It’s not about case when, but about rand().
> Non-deterministic
> >> >>>> >>> expressions
> >> >>>> >>> >> are not allowed in join condition.
> >> >>>> >>> >>
> >> >>>> >>> >> > On 13 Jul 2017, at 6:43 PM, wangshuang &lt;
> >> >>>> >>>
> >> >>>> >>> > cn_wss@
> >> >>>> >>>
> >> >>>> >>> > &gt; wrote:
> >> >>>> >>> >> >
> >> >>>> >>> >> > I'm trying to execute hive sql on spark sql (Also on spark
> >> >>>> >>> >> thriftserver), For
> >> >>>> >>> >> > optimizing data skew, we use "case when" to handle null.
> >> >>>> >>> >> > Simple sql as following:
> >> >>>> >>> >> >
> >> >>>> >>> >> >
> >> >>>> >>> >> > SELECT a.col1
> >> >>>> >>> >> > FROM tbl1 a
> >> >>>> >>> >> > LEFT OUTER JOIN tbl2 b
> >> >>>> >>> >> > ON
> >> >>>> >>> >> > *     CASE
> >> >>>> >>> >> >               WHEN a.col2 IS NULL
> >> >>>> >>> >> >                       TNEN cast(rand(9)*1000 - 9999999999
> >> as
> >> >>>> >>> string)
> >> >>>> >>> >> >               ELSE
> >> >>>> >>> >> >                       a.col2 END *
> >> >>>> >>> >> >       = b.col3;
> >> >>>> >>> >> >
> >> >>>> >>> >> >
> >> >>>> >>> >> > But I get the error:
> >> >>>> >>> >> >
> >> >>>> >>> >> > == Physical Plan ==
> >> >>>> >>> >> > *org.apache.spark.sql.AnalysisException: nondeterministic
> >> >>>> >>> expressions
> >> >>>> >>> >> are
> >> >>>> >>> >> > only allowed in
> >> >>>> >>> >> > Project, Filter, Aggregate or Window, found:*
> >> >>>> >>> >> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) *
> >> >>>> CAST(1000
> >> >>>> >>> AS
> >> >>>> >>> >> > DOUBLE)) - CAST(9999999999L AS DOUBLE)) AS STRING) ELSE
> >> >>>> >>> a.`nav_tcdt`
> >> >>>> >>> >> END
> >> >>>> >>> >> =
> >> >>>> >>> >> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND
> >> >>>> >>> >> (c.`cur_flag`
> >> >>>> >>> >> =
> >> >>>> >>> >> > 1))
> >> >>>> >>> >> > in operator Join LeftOuter, (((CASE WHEN
> >> isnull(nav_tcdt#25)
> >> >>>> THEN
> >> >>>> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999
> as
> >> >>>> >>> double))
> >> >>>> >>> as
> >> >>>> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) &&
> >> >>>> >>> (cast(nav_tcd#26
> >> >>>> >>> as
> >> >>>> >>> >> int)
> >> >>>> >>> >> > = 9)) && (cur_flag#77 = 1))
> >> >>>> >>> >> >               ;;
> >> >>>> >>> >> > GlobalLimit 10
> >> >>>> >>> >> > +- LocalLimit 10
> >> >>>> >>> >> >   +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as
> >> >>>> string) IN
> >> >>>> >>> >> > (cast(19596 as string),cast(20134 as string),cast(10997 as
> >> >>>> string))
> >> >>>> >>> &&
> >> >>>> >>> >> > nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE
> >> nav_tpa_id#21
> >> >>>> >>> END],
> >> >>>> >>> >> > [date_id#7]
> >> >>>> >>> >> >      +- Filter (date_id#7 = 2017-07-12)
> >> >>>> >>> >> >         +- Join LeftOuter, (((CASE WHEN
> isnull(nav_tcdt#25)
> >> >>>> THEN
> >> >>>> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999
> as
> >> >>>> >>> double))
> >> >>>> >>> as
> >> >>>> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) &&
> >> >>>> >>> (cast(nav_tcd#26
> >> >>>> >>> as
> >> >>>> >>> >> int)
> >> >>>> >>> >> > = 9)) && (cur_flag#77 = 1))
> >> >>>> >>> >> >            :- SubqueryAlias a
> >> >>>> >>> >> >            :  +- SubqueryAlias tmp_lifan_trfc_tpa_hive
> >> >>>> >>> >> >            :     +- CatalogRelation
> >> >>>> >>> `tmp`.`tmp_lifan_trfc_tpa_hive`,
> >> >>>> >>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
> >> >>>> [date_id#7,
> >> >>>> >>> >> chanl_id#8L,
> >> >>>> >>> >> > pltfm_id#9, city_id#10, sessn_id#11, gu_id#12,
> >> >>>> >>> >> nav_refer_page_type_id#13,
> >> >>>> >>> >> > nav_refer_page_value#14, nav_refer_tpa#15,
> >> >>>> nav_refer_tpa_id#16,
> >> >>>> >>> >> > nav_refer_tpc#17, nav_refer_tpi#18, nav_page_type_id#19,
> >> >>>> >>> >> nav_page_value#20,
> >> >>>> >>> >> > nav_tpa_id#21, nav_tpa#22, nav_tpc#23, nav_tpi#24,
> >> >>>> nav_tcdt#25,
> >> >>>> >>> >> nav_tcd#26,
> >> >>>> >>> >> > nav_tci#27, nav_tce#28, detl_refer_page_type_id#29,
> >> >>>> >>> >> > detl_refer_page_value#30, ... 33 more fields]
> >> >>>> >>> >> >            +- SubqueryAlias c
> >> >>>> >>> >> >               +- SubqueryAlias dim_site_categ_ext
> >> >>>> >>> >> >                  +- CatalogRelation
> >> `dw`.`dim_site_categ_ext`,
> >> >>>> >>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
> >> >>>> >>> >> [site_categ_skid#64L,
> >> >>>> >>> >> > site_categ_type#65, site_categ_code#66,
> site_categ_name#67,
> >> >>>> >>> >> > site_categ_parnt_skid#68L, site_categ_kywrd#69,
> >> leaf_flg#70L,
> >> >>>> >>> >> sort_seq#71L,
> >> >>>> >>> >> > site_categ_srch_name#72, vsbl_flg#73, delet_flag#74,
> >> >>>> >>> etl_batch_id#75L,
> >> >>>> >>> >> > updt_time#76, cur_flag#77, bkgrnd_categ_skid#78L,
> >> >>>> >>> bkgrnd_categ_id#79L,
> >> >>>> >>> >> > site_categ_id#80, site_categ_parnt_id#81]
> >> >>>> >>> >> >
> >> >>>> >>> >> > Does spark sql not support syntax "case when" in JOIN?
> >> >>>> Additional,
> >> >>>> >>> my
> >> >>>> >>> >> spark
> >> >>>> >>> >> > version is 2.2.0.
> >> >>>> >>> >> > Any help would be greatly appreciated.
> >> >>>> >>> >> >
> >> >>>> >>> >> >
> >> >>>> >>> >> >
> >> >>>> >>> >> >
> >> >>>> >>> >> > --
> >> >>>> >>> >> > View this message in context:
> >> http://apache-spark-developers
> >> >>>> >>> >> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-
> >> >>>> >>> >> be-supported-in-JOIN-tp21953.html
> >> >>>> >>> >> > Sent from the Apache Spark Developers List mailing list
> >> >>>> archive
> >> >>>> at
> >> >>>> >>> >> Nabble.com.
> >> >>>> >>> >> >
> >> >>>> >>> >> >
> >> ------------------------------------------------------------
> >> >>>> >>> ---------
> >> >>>> >>> >> > To unsubscribe e-mail:
> >> >>>> >>>
> >> >>>> >>> > dev-unsubscribe@.apache
> >> >>>> >>>
> >> >>>> >>> >> >
> >> >>>> >>> >>
> >> >>>> >>> >>
> >> >>>> >>> >> ------------------------------
> ------------------------------
> >> >>>> ---------
> >> >>>> >>> >> To unsubscribe e-mail:
> >> >>>> >>>
> >> >>>> >>> > dev-unsubscribe@.apache
> >> >>>> >>>
> >> >>>> >>> >>
> >> >>>> >>> >>
> >> >>>> >>>
> >> >>>> >>>
> >> >>>> >>>
> >> >>>> >>>
> >> >>>> >>>
> >> >>>> >>> -----
> >> >>>> >>> Liang-Chi Hsieh | @viirya
> >> >>>> >>> Spark Technology Center
> >> >>>> >>> http://www.spark.tc/
> >> >>>> >>> --
> >> >>>> >>> View this message in context: http://apache-spark-developers
> >> >>>> >>> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be-
> >> >>>> >>> supported-in-JOIN-tp21953p21961.html
> >> >>>> >>> Sent from the Apache Spark Developers List mailing list archive
> >> at
> >> >>>> >>> Nabble.com.
> >> >>>> >>>
> >> >>>> >>> ------------------------------------------------------------
> >> >>>> ---------
> >> >>>> >>> To unsubscribe e-mail:
> >> >>>>
> >> >>>> > dev-unsubscribe@.apache
> >> >>>>
> >> >>>> >>>
> >> >>>> >>>
> >> >>>> >>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> -----
> >> >>>> Liang-Chi Hsieh | @viirya
> >> >>>> Spark Technology Center
> >> >>>> http://www.spark.tc/
> >> >>>> --
> >> >>>> View this message in context: http://apache-spark-developers
> >> >>>> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be-
> >> >>>> supported-in-JOIN-tp21953p21973.html
> >> >>>> Sent from the Apache Spark Developers List mailing list archive at
> >> >>>> Nabble.com.
> >> >>>>
> >> >>>>
> >> ---------------------------------------------------------------------
> >> >>>> To unsubscribe e-mail:
> >>
> >> > dev-unsubscribe@.apache
> >>
> >> >>>>
> >> >>>>
> >> >>>
> >> >>
> >>
> >>
> >>
> >>
> >>
> >> -----
> >> Liang-Chi Hsieh | @viirya
> >> Spark Technology Center
> >> http://www.spark.tc/
> >> --
> >> View this message in context: http://apache-spark-
> >> developers-list.1001551.n3.nabble.com/SQL-Syntax-case-
> >> when-doesn-t-be-supported-in-JOIN-tp21953p21982.html
> >> Sent from the Apache Spark Developers List mailing list archive at
> >> Nabble.com.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail:
>
> > dev-unsubscribe@.apache
>
> >>
> >>
>
>
>
>
>
> -----
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/SQL-Syntax-case-
> when-doesn-t-be-supported-in-JOIN-tp21953p21988.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

Reply via email to