Sorry, I didn't express clearly. I think the evaluation order doesn't matter in the context of join implementation(sort or hash based). it should only refer to join key.
Thanks Chang On Tue, Jul 18, 2017 at 7:57 AM, Liang-Chi Hsieh <vii...@gmail.com> wrote: > > Evaluation order does matter. A non-deterministic expression can change its > output due to internal state which may depend on input order. > > MonotonicallyIncreasingID is an example for the stateful expression. Once > you change the row order, the evaluation results are different. > > > > Chang Chen wrote > > I see. > > > > Actually, it isn't about evaluation order which user can't specify. It's > > about how many times we evaluate the non-deterministic expression for the > > same row. > > > > For example, given the SQL: > > > > SELECT a.col1 > > FROM tbl1 a > > LEFT OUTER JOIN tbl2 b > > ON > > CASE WHEN a.col2 IS NULL TNEN cast(rand(9)*1000 - 9999999999 as string) > > ELSE a.col2 END > > = > > CASE WHEN b.col3 IS NULL TNEN cast(rand(9)*1000 - 9999999999 as string) > > ELSE b.col3 END; > > > > I think if we exactly evaluate join key one time for each row of a and > b > > in the whole pipeline, even if the result isn't deterministic, but the > > computation is correct. > > > > Thanks > > Chang > > > > > > On Mon, Jul 17, 2017 at 10:49 PM, Liang-Chi Hsieh < > > > viirya@ > > > > wrote: > > > >> > >> IIUC, the evaluation order of rows in Join can be different in different > >> physical operators, e.g., Sort-based and Hash-based. > >> > >> But for non-deterministic expressions, different evaluation orders > change > >> results. > >> > >> > >> > >> Chang Chen wrote > >> > I see the issue. I will try https://github.com/apache/ > spark/pull/18652, > >> I > >> > think > >> > > >> > 1 For Join Operator, the left and right plan can't be > >> non-deterministic. > >> > 2 If Filter can support non-deterministic, why not join condition? > >> > 3 We can't push down or project non-deterministic expression, since it > >> may > >> > change semantics. > >> > > >> > Actually, the real problem is #2. If the join condition could be > >> > non-deterministic, then we needn't insert project. > >> > > >> > Thanks > >> > Chang > >> > > >> > > >> > > >> > > >> > On Mon, Jul 17, 2017 at 3:59 PM, 蒋星博 < > >> > >> > jiangxb1987@ > >> > >> > > wrote: > >> > > >> >> FYI there have been a related discussion here: > >> https://github.com/apache/ > >> >> spark/pull/15417#discussion_r85295977 > >> >> > >> >> 2017-07-17 15:44 GMT+08:00 Chang Chen < > >> > >> > baibaichen@ > >> > >> > >: > >> >> > >> >>> Hi All > >> >>> > >> >>> I don't understand the difference between the semantics, I found > >> Spark > >> >>> does the same thing for GroupBy non-deterministic. From Map-Reduce > >> point > >> >>> of > >> >>> view, Join is also GroupBy in essence . > >> >>> > >> >>> @Liang Chi Hsieh > >> >>> <https://plus.google.com/u/0/103179362592085650735?prsrc=4> > >> >>> > >> >>> in which situation, semantics will be changed? > >> >>> > >> >>> Thanks > >> >>> Chang > >> >>> > >> >>> On Mon, Jul 17, 2017 at 3:29 PM, Liang-Chi Hsieh < > >> > >> > viirya@ > >> > >> > > > >> >>> wrote: > >> >>> > >> >>>> > >> >>>> Thinking about it more, I think it changes the semantics only under > >> >>>> certain > >> >>>> scenarios. > >> >>>> > >> >>>> For the example SQL query shown in previous discussion, it looks > the > >> >>>> same > >> >>>> semantics. > >> >>>> > >> >>>> > >> >>>> Xiao Li wrote > >> >>>> > If the join condition is non-deterministic, pushing it down to > the > >> >>>> > underlying project will change the semantics. Thus, we are unable > >> to > >> >>>> do it > >> >>>> > in PullOutNondeterministic. Users can do it manually if they do > >> not > >> >>>> care > >> >>>> > the semantics difference. > >> >>>> > > >> >>>> > Thanks, > >> >>>> > > >> >>>> > Xiao > >> >>>> > > >> >>>> > > >> >>>> > > >> >>>> > 2017-07-16 20:07 GMT-07:00 Chang Chen < > >> >>>> > >> >>>> > baibaichen@ > >> >>>> > >> >>>> > >: > >> >>>> > > >> >>>> >> It is tedious since we have lots of Hive SQL being migrated to > >> >>>> Spark. > >> >>>> >> And > >> >>>> >> this workaround is equivalent to insert a Project between Join > >> >>>> operator > >> >>>> >> and its child. > >> >>>> >> > >> >>>> >> Why not do it in PullOutNondeterministic? > >> >>>> >> > >> >>>> >> Thanks > >> >>>> >> Chang > >> >>>> >> > >> >>>> >> > >> >>>> >> On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh < > >> >>>> > >> >>>> > viirya@ > >> >>>> > >> >>>> > > wrote: > >> >>>> >> > >> >>>> >>> > >> >>>> >>> A possible workaround is to add the rand column into tbl1 with > a > >> >>>> >>> projection > >> >>>> >>> before the join. > >> >>>> >>> > >> >>>> >>> SELECT a.col1 > >> >>>> >>> FROM ( > >> >>>> >>> SELECT col1, > >> >>>> >>> CASE > >> >>>> >>> WHEN col2 IS NULL > >> >>>> >>> THEN cast(rand(9)*1000 - 9999999999 as string) > >> >>>> >>> ELSE > >> >>>> >>> col2 > >> >>>> >>> END AS col2 > >> >>>> >>> FROM tbl1) a > >> >>>> >>> LEFT OUTER JOIN tbl2 b > >> >>>> >>> ON a.col2 = b.col3; > >> >>>> >>> > >> >>>> >>> > >> >>>> >>> > >> >>>> >>> Chang Chen wrote > >> >>>> >>> > Hi Wenchen > >> >>>> >>> > > >> >>>> >>> > Yes. We also find this error is caused by Rand. However, this > >> is > >> >>>> >>> classic > >> >>>> >>> > way to solve data skew in Hive. Is there any equivalent way > >> in > >> >>>> Spark? > >> >>>> >>> > > >> >>>> >>> > Thanks > >> >>>> >>> > Chang > >> >>>> >>> > > >> >>>> >>> > On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan < > >> >>>> >>> > >> >>>> >>> > cloud0fan@ > >> >>>> >>> > >> >>>> >>> > > wrote: > >> >>>> >>> > > >> >>>> >>> >> It’s not about case when, but about rand(). > Non-deterministic > >> >>>> >>> expressions > >> >>>> >>> >> are not allowed in join condition. > >> >>>> >>> >> > >> >>>> >>> >> > On 13 Jul 2017, at 6:43 PM, wangshuang < > >> >>>> >>> > >> >>>> >>> > cn_wss@ > >> >>>> >>> > >> >>>> >>> > > wrote: > >> >>>> >>> >> > > >> >>>> >>> >> > I'm trying to execute hive sql on spark sql (Also on spark > >> >>>> >>> >> thriftserver), For > >> >>>> >>> >> > optimizing data skew, we use "case when" to handle null. > >> >>>> >>> >> > Simple sql as following: > >> >>>> >>> >> > > >> >>>> >>> >> > > >> >>>> >>> >> > SELECT a.col1 > >> >>>> >>> >> > FROM tbl1 a > >> >>>> >>> >> > LEFT OUTER JOIN tbl2 b > >> >>>> >>> >> > ON > >> >>>> >>> >> > * CASE > >> >>>> >>> >> > WHEN a.col2 IS NULL > >> >>>> >>> >> > TNEN cast(rand(9)*1000 - 9999999999 > >> as > >> >>>> >>> string) > >> >>>> >>> >> > ELSE > >> >>>> >>> >> > a.col2 END * > >> >>>> >>> >> > = b.col3; > >> >>>> >>> >> > > >> >>>> >>> >> > > >> >>>> >>> >> > But I get the error: > >> >>>> >>> >> > > >> >>>> >>> >> > == Physical Plan == > >> >>>> >>> >> > *org.apache.spark.sql.AnalysisException: nondeterministic > >> >>>> >>> expressions > >> >>>> >>> >> are > >> >>>> >>> >> > only allowed in > >> >>>> >>> >> > Project, Filter, Aggregate or Window, found:* > >> >>>> >>> >> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) * > >> >>>> CAST(1000 > >> >>>> >>> AS > >> >>>> >>> >> > DOUBLE)) - CAST(9999999999L AS DOUBLE)) AS STRING) ELSE > >> >>>> >>> a.`nav_tcdt` > >> >>>> >>> >> END > >> >>>> >>> >> = > >> >>>> >>> >> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND > >> >>>> >>> >> (c.`cur_flag` > >> >>>> >>> >> = > >> >>>> >>> >> > 1)) > >> >>>> >>> >> > in operator Join LeftOuter, (((CASE WHEN > >> isnull(nav_tcdt#25) > >> >>>> THEN > >> >>>> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999 > as > >> >>>> >>> double)) > >> >>>> >>> as > >> >>>> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && > >> >>>> >>> (cast(nav_tcd#26 > >> >>>> >>> as > >> >>>> >>> >> int) > >> >>>> >>> >> > = 9)) && (cur_flag#77 = 1)) > >> >>>> >>> >> > ;; > >> >>>> >>> >> > GlobalLimit 10 > >> >>>> >>> >> > +- LocalLimit 10 > >> >>>> >>> >> > +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as > >> >>>> string) IN > >> >>>> >>> >> > (cast(19596 as string),cast(20134 as string),cast(10997 as > >> >>>> string)) > >> >>>> >>> && > >> >>>> >>> >> > nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE > >> nav_tpa_id#21 > >> >>>> >>> END], > >> >>>> >>> >> > [date_id#7] > >> >>>> >>> >> > +- Filter (date_id#7 = 2017-07-12) > >> >>>> >>> >> > +- Join LeftOuter, (((CASE WHEN > isnull(nav_tcdt#25) > >> >>>> THEN > >> >>>> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999 > as > >> >>>> >>> double)) > >> >>>> >>> as > >> >>>> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && > >> >>>> >>> (cast(nav_tcd#26 > >> >>>> >>> as > >> >>>> >>> >> int) > >> >>>> >>> >> > = 9)) && (cur_flag#77 = 1)) > >> >>>> >>> >> > :- SubqueryAlias a > >> >>>> >>> >> > : +- SubqueryAlias tmp_lifan_trfc_tpa_hive > >> >>>> >>> >> > : +- CatalogRelation > >> >>>> >>> `tmp`.`tmp_lifan_trfc_tpa_hive`, > >> >>>> >>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > >> >>>> [date_id#7, > >> >>>> >>> >> chanl_id#8L, > >> >>>> >>> >> > pltfm_id#9, city_id#10, sessn_id#11, gu_id#12, > >> >>>> >>> >> nav_refer_page_type_id#13, > >> >>>> >>> >> > nav_refer_page_value#14, nav_refer_tpa#15, > >> >>>> nav_refer_tpa_id#16, > >> >>>> >>> >> > nav_refer_tpc#17, nav_refer_tpi#18, nav_page_type_id#19, > >> >>>> >>> >> nav_page_value#20, > >> >>>> >>> >> > nav_tpa_id#21, nav_tpa#22, nav_tpc#23, nav_tpi#24, > >> >>>> nav_tcdt#25, > >> >>>> >>> >> nav_tcd#26, > >> >>>> >>> >> > nav_tci#27, nav_tce#28, detl_refer_page_type_id#29, > >> >>>> >>> >> > detl_refer_page_value#30, ... 33 more fields] > >> >>>> >>> >> > +- SubqueryAlias c > >> >>>> >>> >> > +- SubqueryAlias dim_site_categ_ext > >> >>>> >>> >> > +- CatalogRelation > >> `dw`.`dim_site_categ_ext`, > >> >>>> >>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > >> >>>> >>> >> [site_categ_skid#64L, > >> >>>> >>> >> > site_categ_type#65, site_categ_code#66, > site_categ_name#67, > >> >>>> >>> >> > site_categ_parnt_skid#68L, site_categ_kywrd#69, > >> leaf_flg#70L, > >> >>>> >>> >> sort_seq#71L, > >> >>>> >>> >> > site_categ_srch_name#72, vsbl_flg#73, delet_flag#74, > >> >>>> >>> etl_batch_id#75L, > >> >>>> >>> >> > updt_time#76, cur_flag#77, bkgrnd_categ_skid#78L, > >> >>>> >>> bkgrnd_categ_id#79L, > >> >>>> >>> >> > site_categ_id#80, site_categ_parnt_id#81] > >> >>>> >>> >> > > >> >>>> >>> >> > Does spark sql not support syntax "case when" in JOIN? > >> >>>> Additional, > >> >>>> >>> my > >> >>>> >>> >> spark > >> >>>> >>> >> > version is 2.2.0. > >> >>>> >>> >> > Any help would be greatly appreciated. > >> >>>> >>> >> > > >> >>>> >>> >> > > >> >>>> >>> >> > > >> >>>> >>> >> > > >> >>>> >>> >> > -- > >> >>>> >>> >> > View this message in context: > >> http://apache-spark-developers > >> >>>> >>> >> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t- > >> >>>> >>> >> be-supported-in-JOIN-tp21953.html > >> >>>> >>> >> > Sent from the Apache Spark Developers List mailing list > >> >>>> archive > >> >>>> at > >> >>>> >>> >> Nabble.com. > >> >>>> >>> >> > > >> >>>> >>> >> > > >> ------------------------------------------------------------ > >> >>>> >>> --------- > >> >>>> >>> >> > To unsubscribe e-mail: > >> >>>> >>> > >> >>>> >>> > dev-unsubscribe@.apache > >> >>>> >>> > >> >>>> >>> >> > > >> >>>> >>> >> > >> >>>> >>> >> > >> >>>> >>> >> ------------------------------ > ------------------------------ > >> >>>> --------- > >> >>>> >>> >> To unsubscribe e-mail: > >> >>>> >>> > >> >>>> >>> > dev-unsubscribe@.apache > >> >>>> >>> > >> >>>> >>> >> > >> >>>> >>> >> > >> >>>> >>> > >> >>>> >>> > >> >>>> >>> > >> >>>> >>> > >> >>>> >>> > >> >>>> >>> ----- > >> >>>> >>> Liang-Chi Hsieh | @viirya > >> >>>> >>> Spark Technology Center > >> >>>> >>> http://www.spark.tc/ > >> >>>> >>> -- > >> >>>> >>> View this message in context: http://apache-spark-developers > >> >>>> >>> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be- > >> >>>> >>> supported-in-JOIN-tp21953p21961.html > >> >>>> >>> Sent from the Apache Spark Developers List mailing list archive > >> at > >> >>>> >>> Nabble.com. > >> >>>> >>> > >> >>>> >>> ------------------------------------------------------------ > >> >>>> --------- > >> >>>> >>> To unsubscribe e-mail: > >> >>>> > >> >>>> > dev-unsubscribe@.apache > >> >>>> > >> >>>> >>> > >> >>>> >>> > >> >>>> >> > >> >>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> ----- > >> >>>> Liang-Chi Hsieh | @viirya > >> >>>> Spark Technology Center > >> >>>> http://www.spark.tc/ > >> >>>> -- > >> >>>> View this message in context: http://apache-spark-developers > >> >>>> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be- > >> >>>> supported-in-JOIN-tp21953p21973.html > >> >>>> Sent from the Apache Spark Developers List mailing list archive at > >> >>>> Nabble.com. > >> >>>> > >> >>>> > >> --------------------------------------------------------------------- > >> >>>> To unsubscribe e-mail: > >> > >> > dev-unsubscribe@.apache > >> > >> >>>> > >> >>>> > >> >>> > >> >> > >> > >> > >> > >> > >> > >> ----- > >> Liang-Chi Hsieh | @viirya > >> Spark Technology Center > >> http://www.spark.tc/ > >> -- > >> View this message in context: http://apache-spark- > >> developers-list.1001551.n3.nabble.com/SQL-Syntax-case- > >> when-doesn-t-be-supported-in-JOIN-tp21953p21982.html > >> Sent from the Apache Spark Developers List mailing list archive at > >> Nabble.com. > >> > >> --------------------------------------------------------------------- > >> To unsubscribe e-mail: > > > dev-unsubscribe@.apache > > >> > >> > > > > > > ----- > Liang-Chi Hsieh | @viirya > Spark Technology Center > http://www.spark.tc/ > -- > View this message in context: http://apache-spark- > developers-list.1001551.n3.nabble.com/SQL-Syntax-case- > when-doesn-t-be-supported-in-JOIN-tp21953p21988.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >