Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Chang Chen
Sorry, I didn't express clearly.  I think the evaluation order doesn't
matter in the context of join implementation(sort or hash based). it should
only refer to join key.


Thanks
Chang

On Tue, Jul 18, 2017 at 7:57 AM, Liang-Chi Hsieh  wrote:

>
> Evaluation order does matter. A non-deterministic expression can change its
> output due to internal state which may depend on input order.
>
> MonotonicallyIncreasingID is an example for the stateful expression. Once
> you change the row order, the evaluation results are different.
>
>
>
> Chang Chen wrote
> > I see.
> >
> > Actually, it isn't about evaluation order which user can't specify. It's
> > about how many times we evaluate the non-deterministic expression for the
> > same row.
> >
> > For example, given the SQL:
> >
> > SELECT a.col1
> > FROM tbl1 a
> > LEFT OUTER JOIN tbl2 b
> > ON
> >  CASE WHEN a.col2 IS NULL TNEN cast(rand(9)*1000 - 99 as string)
> > ELSE a.col2 END
> > =
> >  CASE WHEN b.col3 IS NULL TNEN cast(rand(9)*1000 - 99 as string)
> > ELSE b.col3 END;
> >
> > I think if we exactly evaluate   join key one time for each row of a and
> b
> > in the whole pipeline, even if the result isn't deterministic, but the
> > computation is correct.
> >
> > Thanks
> > Chang
> >
> >
> > On Mon, Jul 17, 2017 at 10:49 PM, Liang-Chi Hsieh 
>
> > viirya@
>
> >  wrote:
> >
> >>
> >> IIUC, the evaluation order of rows in Join can be different in different
> >> physical operators, e.g., Sort-based and Hash-based.
> >>
> >> But for non-deterministic expressions, different evaluation orders
> change
> >> results.
> >>
> >>
> >>
> >> Chang Chen wrote
> >> > I see the issue. I will try https://github.com/apache/
> spark/pull/18652,
> >> I
> >> > think
> >> >
> >> > 1 For Join Operator, the left and right plan can't be
> >> non-deterministic.
> >> > 2 If  Filter can support non-deterministic, why not join condition?
> >> > 3 We can't push down or project non-deterministic expression, since it
> >> may
> >> > change semantics.
> >> >
> >> > Actually, the real problem is #2. If the join condition could be
> >> > non-deterministic, then we needn't insert project.
> >> >
> >> > Thanks
> >> > Chang
> >> >
> >> >
> >> >
> >> >
> >> > On Mon, Jul 17, 2017 at 3:59 PM, 蒋星博 
> >>
> >> > jiangxb1987@
> >>
> >> >  wrote:
> >> >
> >> >> FYI there have been a related discussion here:
> >> https://github.com/apache/
> >> >> spark/pull/15417#discussion_r85295977
> >> >>
> >> >> 2017-07-17 15:44 GMT+08:00 Chang Chen 
> >>
> >> > baibaichen@
> >>
> >> > :
> >> >>
> >> >>> Hi All
> >> >>>
> >> >>> I don't understand the difference between the semantics, I found
> >> Spark
> >> >>> does the same thing for GroupBy non-deterministic. From Map-Reduce
> >> point
> >> >>> of
> >> >>> view, Join is also GroupBy in essence .
> >> >>>
> >> >>> @Liang Chi Hsieh
> >> >>> https://plus.google.com/u/0/103179362592085650735?prsrc=4;
> >> >>>
> >> >>> in which situation,  semantics  will be changed?
> >> >>>
> >> >>> Thanks
> >> >>> Chang
> >> >>>
> >> >>> On Mon, Jul 17, 2017 at 3:29 PM, Liang-Chi Hsieh 
> >>
> >> > viirya@
> >>
> >> > 
> >> >>> wrote:
> >> >>>
> >> 
> >>  Thinking about it more, I think it changes the semantics only under
> >>  certain
> >>  scenarios.
> >> 
> >>  For the example SQL query shown in previous discussion, it looks
> the
> >>  same
> >>  semantics.
> >> 
> >> 
> >>  Xiao Li wrote
> >>  > If the join condition is non-deterministic, pushing it down to
> the
> >>  > underlying project will change the semantics. Thus, we are unable
> >> to
> >>  do it
> >>  > in PullOutNondeterministic. Users can do it manually if they do
> >> not
> >>  care
> >>  > the semantics difference.
> >>  >
> >>  > Thanks,
> >>  >
> >>  > Xiao
> >>  >
> >>  >
> >>  >
> >>  > 2017-07-16 20:07 GMT-07:00 Chang Chen 
> >> 
> >>  > baibaichen@
> >> 
> >>  > :
> >>  >
> >>  >> It is tedious since we have lots of Hive SQL being migrated to
> >>  Spark.
> >>  >> And
> >>  >> this workaround is equivalent  to insert a Project between Join
> >>  operator
> >>  >> and its child.
> >>  >>
> >>  >> Why not do it in PullOutNondeterministic?
> >>  >>
> >>  >> Thanks
> >>  >> Chang
> >>  >>
> >>  >>
> >>  >> On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh 
> >> 
> >>  > viirya@
> >> 
> >>  >  wrote:
> >>  >>
> >>  >>>
> >>  >>> A possible workaround is to add the rand column into tbl1 with
> a
> >>  >>> projection
> >>  >>> before the join.
> >>  >>>
> >>  >>> SELECT a.col1
> >>  >>> FROM (
> >>  >>>   SELECT col1,
> >>  >>> CASE
> >>  >>>  WHEN col2 IS NULL
> >>  >>>THEN cast(rand(9)*1000 - 99 as string)
> >>  >>>  ELSE
> >>  >>>col2
> >>  >>> END AS col2
> >>  

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Liang-Chi Hsieh

Evaluation order does matter. A non-deterministic expression can change its
output due to internal state which may depend on input order.

MonotonicallyIncreasingID is an example for the stateful expression. Once
you change the row order, the evaluation results are different.



Chang Chen wrote
> I see.
> 
> Actually, it isn't about evaluation order which user can't specify. It's
> about how many times we evaluate the non-deterministic expression for the
> same row.
> 
> For example, given the SQL:
> 
> SELECT a.col1
> FROM tbl1 a
> LEFT OUTER JOIN tbl2 b
> ON
>  CASE WHEN a.col2 IS NULL TNEN cast(rand(9)*1000 - 99 as string)
> ELSE a.col2 END
> =
>  CASE WHEN b.col3 IS NULL TNEN cast(rand(9)*1000 - 99 as string)
> ELSE b.col3 END;
> 
> I think if we exactly evaluate   join key one time for each row of a and b
> in the whole pipeline, even if the result isn't deterministic, but the
> computation is correct.
> 
> Thanks
> Chang
> 
> 
> On Mon, Jul 17, 2017 at 10:49 PM, Liang-Chi Hsieh 

> viirya@

>  wrote:
> 
>>
>> IIUC, the evaluation order of rows in Join can be different in different
>> physical operators, e.g., Sort-based and Hash-based.
>>
>> But for non-deterministic expressions, different evaluation orders change
>> results.
>>
>>
>>
>> Chang Chen wrote
>> > I see the issue. I will try https://github.com/apache/spark/pull/18652,
>> I
>> > think
>> >
>> > 1 For Join Operator, the left and right plan can't be
>> non-deterministic.
>> > 2 If  Filter can support non-deterministic, why not join condition?
>> > 3 We can't push down or project non-deterministic expression, since it
>> may
>> > change semantics.
>> >
>> > Actually, the real problem is #2. If the join condition could be
>> > non-deterministic, then we needn't insert project.
>> >
>> > Thanks
>> > Chang
>> >
>> >
>> >
>> >
>> > On Mon, Jul 17, 2017 at 3:59 PM, 蒋星博 
>>
>> > jiangxb1987@
>>
>> >  wrote:
>> >
>> >> FYI there have been a related discussion here:
>> https://github.com/apache/
>> >> spark/pull/15417#discussion_r85295977
>> >>
>> >> 2017-07-17 15:44 GMT+08:00 Chang Chen 
>>
>> > baibaichen@
>>
>> > :
>> >>
>> >>> Hi All
>> >>>
>> >>> I don't understand the difference between the semantics, I found
>> Spark
>> >>> does the same thing for GroupBy non-deterministic. From Map-Reduce
>> point
>> >>> of
>> >>> view, Join is also GroupBy in essence .
>> >>>
>> >>> @Liang Chi Hsieh
>> >>> https://plus.google.com/u/0/103179362592085650735?prsrc=4;
>> >>>
>> >>> in which situation,  semantics  will be changed?
>> >>>
>> >>> Thanks
>> >>> Chang
>> >>>
>> >>> On Mon, Jul 17, 2017 at 3:29 PM, Liang-Chi Hsieh 
>>
>> > viirya@
>>
>> > 
>> >>> wrote:
>> >>>
>> 
>>  Thinking about it more, I think it changes the semantics only under
>>  certain
>>  scenarios.
>> 
>>  For the example SQL query shown in previous discussion, it looks the
>>  same
>>  semantics.
>> 
>> 
>>  Xiao Li wrote
>>  > If the join condition is non-deterministic, pushing it down to the
>>  > underlying project will change the semantics. Thus, we are unable
>> to
>>  do it
>>  > in PullOutNondeterministic. Users can do it manually if they do
>> not
>>  care
>>  > the semantics difference.
>>  >
>>  > Thanks,
>>  >
>>  > Xiao
>>  >
>>  >
>>  >
>>  > 2017-07-16 20:07 GMT-07:00 Chang Chen 
>> 
>>  > baibaichen@
>> 
>>  > :
>>  >
>>  >> It is tedious since we have lots of Hive SQL being migrated to
>>  Spark.
>>  >> And
>>  >> this workaround is equivalent  to insert a Project between Join
>>  operator
>>  >> and its child.
>>  >>
>>  >> Why not do it in PullOutNondeterministic?
>>  >>
>>  >> Thanks
>>  >> Chang
>>  >>
>>  >>
>>  >> On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh 
>> 
>>  > viirya@
>> 
>>  >  wrote:
>>  >>
>>  >>>
>>  >>> A possible workaround is to add the rand column into tbl1 with a
>>  >>> projection
>>  >>> before the join.
>>  >>>
>>  >>> SELECT a.col1
>>  >>> FROM (
>>  >>>   SELECT col1,
>>  >>> CASE
>>  >>>  WHEN col2 IS NULL
>>  >>>THEN cast(rand(9)*1000 - 99 as string)
>>  >>>  ELSE
>>  >>>col2
>>  >>> END AS col2
>>  >>> FROM tbl1) a
>>  >>> LEFT OUTER JOIN tbl2 b
>>  >>> ON a.col2 = b.col3;
>>  >>>
>>  >>>
>>  >>>
>>  >>> Chang Chen wrote
>>  >>> > Hi Wenchen
>>  >>> >
>>  >>> > Yes. We also find this error is caused by Rand. However, this
>> is
>>  >>> classic
>>  >>> > way to solve data skew in Hive.  Is there any equivalent way
>> in
>>  Spark?
>>  >>> >
>>  >>> > Thanks
>>  >>> > Chang
>>  >>> >
>>  >>> > On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan 
>>  >>>
>>  >>> > cloud0fan@
>>  >>>
>>  >>> >  wrote:
>>  >>> >
>>  >>> 

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Xiao Li
When users call rand(seed) with a specific seed number, users expect the
results should be deterministic no matter whether this is pushed down or
not. rand(seed) is stateful. Thus, the order of predicates in the same join
condition even matters. For example, in the same join condition, if the
first predicate is false and the second one is skipped. If we simply push
it down to the child, the number of rand(seed) calls is different.

Thanks,

Xiao


2017-07-17 9:28 GMT-07:00 Chang Chen :

> I see.
>
> Actually, it isn't about evaluation order which user can't specify. It's
> about how many times we evaluate the non-deterministic expression for the
> same row.
>
> For example, given the SQL:
>
> SELECT a.col1
> FROM tbl1 a
> LEFT OUTER JOIN tbl2 b
> ON
>  CASE WHEN a.col2 IS NULL TNEN cast(rand(9)*1000 - 99 as string)
> ELSE a.col2 END
> =
>  CASE WHEN b.col3 IS NULL TNEN cast(rand(9)*1000 - 99 as string)
> ELSE b.col3 END;
>
> I think if we exactly evaluate   join key one time for each row of a and
> b in the whole pipeline, even if the result isn't deterministic, but the
> computation is correct.
>
> Thanks
> Chang
>
>
> On Mon, Jul 17, 2017 at 10:49 PM, Liang-Chi Hsieh 
> wrote:
>
>>
>> IIUC, the evaluation order of rows in Join can be different in different
>> physical operators, e.g., Sort-based and Hash-based.
>>
>> But for non-deterministic expressions, different evaluation orders change
>> results.
>>
>>
>>
>> Chang Chen wrote
>> > I see the issue. I will try https://github.com/apache/spark/pull/18652,
>> I
>> > think
>> >
>> > 1 For Join Operator, the left and right plan can't be non-deterministic.
>> > 2 If  Filter can support non-deterministic, why not join condition?
>> > 3 We can't push down or project non-deterministic expression, since it
>> may
>> > change semantics.
>> >
>> > Actually, the real problem is #2. If the join condition could be
>> > non-deterministic, then we needn't insert project.
>> >
>> > Thanks
>> > Chang
>> >
>> >
>> >
>> >
>> > On Mon, Jul 17, 2017 at 3:59 PM, 蒋星博 
>>
>> > jiangxb1987@
>>
>> >  wrote:
>> >
>> >> FYI there have been a related discussion here:
>> https://github.com/apache/
>> >> spark/pull/15417#discussion_r85295977
>> >>
>> >> 2017-07-17 15:44 GMT+08:00 Chang Chen 
>>
>> > baibaichen@
>>
>> > :
>> >>
>> >>> Hi All
>> >>>
>> >>> I don't understand the difference between the semantics, I found Spark
>> >>> does the same thing for GroupBy non-deterministic. From Map-Reduce
>> point
>> >>> of
>> >>> view, Join is also GroupBy in essence .
>> >>>
>> >>> @Liang Chi Hsieh
>> >>> https://plus.google.com/u/0/103179362592085650735?prsrc=4;
>> >>>
>> >>> in which situation,  semantics  will be changed?
>> >>>
>> >>> Thanks
>> >>> Chang
>> >>>
>> >>> On Mon, Jul 17, 2017 at 3:29 PM, Liang-Chi Hsieh 
>>
>> > viirya@
>>
>> > 
>> >>> wrote:
>> >>>
>> 
>>  Thinking about it more, I think it changes the semantics only under
>>  certain
>>  scenarios.
>> 
>>  For the example SQL query shown in previous discussion, it looks the
>>  same
>>  semantics.
>> 
>> 
>>  Xiao Li wrote
>>  > If the join condition is non-deterministic, pushing it down to the
>>  > underlying project will change the semantics. Thus, we are unable
>> to
>>  do it
>>  > in PullOutNondeterministic. Users can do it manually if they do not
>>  care
>>  > the semantics difference.
>>  >
>>  > Thanks,
>>  >
>>  > Xiao
>>  >
>>  >
>>  >
>>  > 2017-07-16 20:07 GMT-07:00 Chang Chen 
>> 
>>  > baibaichen@
>> 
>>  > :
>>  >
>>  >> It is tedious since we have lots of Hive SQL being migrated to
>>  Spark.
>>  >> And
>>  >> this workaround is equivalent  to insert a Project between Join
>>  operator
>>  >> and its child.
>>  >>
>>  >> Why not do it in PullOutNondeterministic?
>>  >>
>>  >> Thanks
>>  >> Chang
>>  >>
>>  >>
>>  >> On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh 
>> 
>>  > viirya@
>> 
>>  >  wrote:
>>  >>
>>  >>>
>>  >>> A possible workaround is to add the rand column into tbl1 with a
>>  >>> projection
>>  >>> before the join.
>>  >>>
>>  >>> SELECT a.col1
>>  >>> FROM (
>>  >>>   SELECT col1,
>>  >>> CASE
>>  >>>  WHEN col2 IS NULL
>>  >>>THEN cast(rand(9)*1000 - 99 as string)
>>  >>>  ELSE
>>  >>>col2
>>  >>> END AS col2
>>  >>> FROM tbl1) a
>>  >>> LEFT OUTER JOIN tbl2 b
>>  >>> ON a.col2 = b.col3;
>>  >>>
>>  >>>
>>  >>>
>>  >>> Chang Chen wrote
>>  >>> > Hi Wenchen
>>  >>> >
>>  >>> > Yes. We also find this error is caused by Rand. However, this
>> is
>>  >>> classic
>>  >>> > way to solve data skew in Hive.  Is there any equivalent way in
>>  Spark?
>>  >>> >
>>  >>> 

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Chang Chen
I see.

Actually, it isn't about evaluation order which user can't specify. It's
about how many times we evaluate the non-deterministic expression for the
same row.

For example, given the SQL:

SELECT a.col1
FROM tbl1 a
LEFT OUTER JOIN tbl2 b
ON
 CASE WHEN a.col2 IS NULL TNEN cast(rand(9)*1000 - 99 as string)
ELSE a.col2 END
=
 CASE WHEN b.col3 IS NULL TNEN cast(rand(9)*1000 - 99 as string)
ELSE b.col3 END;

I think if we exactly evaluate   join key one time for each row of a and b
in the whole pipeline, even if the result isn't deterministic, but the
computation is correct.

Thanks
Chang


On Mon, Jul 17, 2017 at 10:49 PM, Liang-Chi Hsieh  wrote:

>
> IIUC, the evaluation order of rows in Join can be different in different
> physical operators, e.g., Sort-based and Hash-based.
>
> But for non-deterministic expressions, different evaluation orders change
> results.
>
>
>
> Chang Chen wrote
> > I see the issue. I will try https://github.com/apache/spark/pull/18652,
> I
> > think
> >
> > 1 For Join Operator, the left and right plan can't be non-deterministic.
> > 2 If  Filter can support non-deterministic, why not join condition?
> > 3 We can't push down or project non-deterministic expression, since it
> may
> > change semantics.
> >
> > Actually, the real problem is #2. If the join condition could be
> > non-deterministic, then we needn't insert project.
> >
> > Thanks
> > Chang
> >
> >
> >
> >
> > On Mon, Jul 17, 2017 at 3:59 PM, 蒋星博 
>
> > jiangxb1987@
>
> >  wrote:
> >
> >> FYI there have been a related discussion here:
> https://github.com/apache/
> >> spark/pull/15417#discussion_r85295977
> >>
> >> 2017-07-17 15:44 GMT+08:00 Chang Chen 
>
> > baibaichen@
>
> > :
> >>
> >>> Hi All
> >>>
> >>> I don't understand the difference between the semantics, I found Spark
> >>> does the same thing for GroupBy non-deterministic. From Map-Reduce
> point
> >>> of
> >>> view, Join is also GroupBy in essence .
> >>>
> >>> @Liang Chi Hsieh
> >>> https://plus.google.com/u/0/103179362592085650735?prsrc=4;
> >>>
> >>> in which situation,  semantics  will be changed?
> >>>
> >>> Thanks
> >>> Chang
> >>>
> >>> On Mon, Jul 17, 2017 at 3:29 PM, Liang-Chi Hsieh 
>
> > viirya@
>
> > 
> >>> wrote:
> >>>
> 
>  Thinking about it more, I think it changes the semantics only under
>  certain
>  scenarios.
> 
>  For the example SQL query shown in previous discussion, it looks the
>  same
>  semantics.
> 
> 
>  Xiao Li wrote
>  > If the join condition is non-deterministic, pushing it down to the
>  > underlying project will change the semantics. Thus, we are unable to
>  do it
>  > in PullOutNondeterministic. Users can do it manually if they do not
>  care
>  > the semantics difference.
>  >
>  > Thanks,
>  >
>  > Xiao
>  >
>  >
>  >
>  > 2017-07-16 20:07 GMT-07:00 Chang Chen 
> 
>  > baibaichen@
> 
>  > :
>  >
>  >> It is tedious since we have lots of Hive SQL being migrated to
>  Spark.
>  >> And
>  >> this workaround is equivalent  to insert a Project between Join
>  operator
>  >> and its child.
>  >>
>  >> Why not do it in PullOutNondeterministic?
>  >>
>  >> Thanks
>  >> Chang
>  >>
>  >>
>  >> On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh 
> 
>  > viirya@
> 
>  >  wrote:
>  >>
>  >>>
>  >>> A possible workaround is to add the rand column into tbl1 with a
>  >>> projection
>  >>> before the join.
>  >>>
>  >>> SELECT a.col1
>  >>> FROM (
>  >>>   SELECT col1,
>  >>> CASE
>  >>>  WHEN col2 IS NULL
>  >>>THEN cast(rand(9)*1000 - 99 as string)
>  >>>  ELSE
>  >>>col2
>  >>> END AS col2
>  >>> FROM tbl1) a
>  >>> LEFT OUTER JOIN tbl2 b
>  >>> ON a.col2 = b.col3;
>  >>>
>  >>>
>  >>>
>  >>> Chang Chen wrote
>  >>> > Hi Wenchen
>  >>> >
>  >>> > Yes. We also find this error is caused by Rand. However, this is
>  >>> classic
>  >>> > way to solve data skew in Hive.  Is there any equivalent way in
>  Spark?
>  >>> >
>  >>> > Thanks
>  >>> > Chang
>  >>> >
>  >>> > On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan 
>  >>>
>  >>> > cloud0fan@
>  >>>
>  >>> >  wrote:
>  >>> >
>  >>> >> It’s not about case when, but about rand(). Non-deterministic
>  >>> expressions
>  >>> >> are not allowed in join condition.
>  >>> >>
>  >>> >> > On 13 Jul 2017, at 6:43 PM, wangshuang 
>  >>>
>  >>> > cn_wss@
>  >>>
>  >>> >  wrote:
>  >>> >> >
>  >>> >> > I'm trying to execute hive sql on spark sql (Also on spark
>  >>> >> thriftserver), For
>  >>> >> > optimizing data skew, we use "case when" to handle null.
>  >>> >> > Simple sql as following:
>  >>> >> >

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Liang-Chi Hsieh

IIUC, the evaluation order of rows in Join can be different in different
physical operators, e.g., Sort-based and Hash-based.

But for non-deterministic expressions, different evaluation orders change
results.



Chang Chen wrote
> I see the issue. I will try https://github.com/apache/spark/pull/18652, I
> think
> 
> 1 For Join Operator, the left and right plan can't be non-deterministic.
> 2 If  Filter can support non-deterministic, why not join condition?
> 3 We can't push down or project non-deterministic expression, since it may
> change semantics.
> 
> Actually, the real problem is #2. If the join condition could be
> non-deterministic, then we needn't insert project.
> 
> Thanks
> Chang
> 
> 
> 
> 
> On Mon, Jul 17, 2017 at 3:59 PM, 蒋星博 

> jiangxb1987@

>  wrote:
> 
>> FYI there have been a related discussion here: https://github.com/apache/
>> spark/pull/15417#discussion_r85295977
>>
>> 2017-07-17 15:44 GMT+08:00 Chang Chen 

> baibaichen@

> :
>>
>>> Hi All
>>>
>>> I don't understand the difference between the semantics, I found Spark
>>> does the same thing for GroupBy non-deterministic. From Map-Reduce point
>>> of
>>> view, Join is also GroupBy in essence .
>>>
>>> @Liang Chi Hsieh
>>> https://plus.google.com/u/0/103179362592085650735?prsrc=4;
>>>
>>> in which situation,  semantics  will be changed?
>>>
>>> Thanks
>>> Chang
>>>
>>> On Mon, Jul 17, 2017 at 3:29 PM, Liang-Chi Hsieh 

> viirya@

> 
>>> wrote:
>>>

 Thinking about it more, I think it changes the semantics only under
 certain
 scenarios.

 For the example SQL query shown in previous discussion, it looks the
 same
 semantics.


 Xiao Li wrote
 > If the join condition is non-deterministic, pushing it down to the
 > underlying project will change the semantics. Thus, we are unable to
 do it
 > in PullOutNondeterministic. Users can do it manually if they do not
 care
 > the semantics difference.
 >
 > Thanks,
 >
 > Xiao
 >
 >
 >
 > 2017-07-16 20:07 GMT-07:00 Chang Chen 

 > baibaichen@

 > :
 >
 >> It is tedious since we have lots of Hive SQL being migrated to
 Spark.
 >> And
 >> this workaround is equivalent  to insert a Project between Join
 operator
 >> and its child.
 >>
 >> Why not do it in PullOutNondeterministic?
 >>
 >> Thanks
 >> Chang
 >>
 >>
 >> On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh 

 > viirya@

 >  wrote:
 >>
 >>>
 >>> A possible workaround is to add the rand column into tbl1 with a
 >>> projection
 >>> before the join.
 >>>
 >>> SELECT a.col1
 >>> FROM (
 >>>   SELECT col1,
 >>> CASE
 >>>  WHEN col2 IS NULL
 >>>THEN cast(rand(9)*1000 - 99 as string)
 >>>  ELSE
 >>>col2
 >>> END AS col2
 >>> FROM tbl1) a
 >>> LEFT OUTER JOIN tbl2 b
 >>> ON a.col2 = b.col3;
 >>>
 >>>
 >>>
 >>> Chang Chen wrote
 >>> > Hi Wenchen
 >>> >
 >>> > Yes. We also find this error is caused by Rand. However, this is
 >>> classic
 >>> > way to solve data skew in Hive.  Is there any equivalent way in
 Spark?
 >>> >
 >>> > Thanks
 >>> > Chang
 >>> >
 >>> > On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan 
 >>>
 >>> > cloud0fan@
 >>>
 >>> >  wrote:
 >>> >
 >>> >> It’s not about case when, but about rand(). Non-deterministic
 >>> expressions
 >>> >> are not allowed in join condition.
 >>> >>
 >>> >> > On 13 Jul 2017, at 6:43 PM, wangshuang 
 >>>
 >>> > cn_wss@
 >>>
 >>> >  wrote:
 >>> >> >
 >>> >> > I'm trying to execute hive sql on spark sql (Also on spark
 >>> >> thriftserver), For
 >>> >> > optimizing data skew, we use "case when" to handle null.
 >>> >> > Simple sql as following:
 >>> >> >
 >>> >> >
 >>> >> > SELECT a.col1
 >>> >> > FROM tbl1 a
 >>> >> > LEFT OUTER JOIN tbl2 b
 >>> >> > ON
 >>> >> > * CASE
 >>> >> >   WHEN a.col2 IS NULL
 >>> >> >   TNEN cast(rand(9)*1000 - 99 as
 >>> string)
 >>> >> >   ELSE
 >>> >> >   a.col2 END *
 >>> >> >   = b.col3;
 >>> >> >
 >>> >> >
 >>> >> > But I get the error:
 >>> >> >
 >>> >> > == Physical Plan ==
 >>> >> > *org.apache.spark.sql.AnalysisException: nondeterministic
 >>> expressions
 >>> >> are
 >>> >> > only allowed in
 >>> >> > Project, Filter, Aggregate or Window, found:*
 >>> >> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) *
 CAST(1000
 >>> AS
 >>> >> > DOUBLE)) - CAST(99L AS DOUBLE)) AS STRING) ELSE
 >>> a.`nav_tcdt`
 >>> >> END
 >>> >> =
 >>> >> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND
 >>> >> 

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Chang Chen
I see the issue. I will try https://github.com/apache/spark/pull/18652, I
think

1 For Join Operator, the left and right plan can't be non-deterministic.
2 If  Filter can support non-deterministic, why not join condition?
3 We can't push down or project non-deterministic expression, since it may
change semantics.

Actually, the real problem is #2. If the join condition could be
non-deterministic, then we needn't insert project.

Thanks
Chang




On Mon, Jul 17, 2017 at 3:59 PM, 蒋星博  wrote:

> FYI there have been a related discussion here: https://github.com/apache/
> spark/pull/15417#discussion_r85295977
>
> 2017-07-17 15:44 GMT+08:00 Chang Chen :
>
>> Hi All
>>
>> I don't understand the difference between the semantics, I found Spark
>> does the same thing for GroupBy non-deterministic. From Map-Reduce point of
>> view, Join is also GroupBy in essence .
>>
>> @Liang Chi Hsieh
>> 
>>
>> in which situation,  semantics  will be changed?
>>
>> Thanks
>> Chang
>>
>> On Mon, Jul 17, 2017 at 3:29 PM, Liang-Chi Hsieh 
>> wrote:
>>
>>>
>>> Thinking about it more, I think it changes the semantics only under
>>> certain
>>> scenarios.
>>>
>>> For the example SQL query shown in previous discussion, it looks the same
>>> semantics.
>>>
>>>
>>> Xiao Li wrote
>>> > If the join condition is non-deterministic, pushing it down to the
>>> > underlying project will change the semantics. Thus, we are unable to
>>> do it
>>> > in PullOutNondeterministic. Users can do it manually if they do not
>>> care
>>> > the semantics difference.
>>> >
>>> > Thanks,
>>> >
>>> > Xiao
>>> >
>>> >
>>> >
>>> > 2017-07-16 20:07 GMT-07:00 Chang Chen 
>>>
>>> > baibaichen@
>>>
>>> > :
>>> >
>>> >> It is tedious since we have lots of Hive SQL being migrated to Spark.
>>> >> And
>>> >> this workaround is equivalent  to insert a Project between Join
>>> operator
>>> >> and its child.
>>> >>
>>> >> Why not do it in PullOutNondeterministic?
>>> >>
>>> >> Thanks
>>> >> Chang
>>> >>
>>> >>
>>> >> On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh 
>>>
>>> > viirya@
>>>
>>> >  wrote:
>>> >>
>>> >>>
>>> >>> A possible workaround is to add the rand column into tbl1 with a
>>> >>> projection
>>> >>> before the join.
>>> >>>
>>> >>> SELECT a.col1
>>> >>> FROM (
>>> >>>   SELECT col1,
>>> >>> CASE
>>> >>>  WHEN col2 IS NULL
>>> >>>THEN cast(rand(9)*1000 - 99 as string)
>>> >>>  ELSE
>>> >>>col2
>>> >>> END AS col2
>>> >>> FROM tbl1) a
>>> >>> LEFT OUTER JOIN tbl2 b
>>> >>> ON a.col2 = b.col3;
>>> >>>
>>> >>>
>>> >>>
>>> >>> Chang Chen wrote
>>> >>> > Hi Wenchen
>>> >>> >
>>> >>> > Yes. We also find this error is caused by Rand. However, this is
>>> >>> classic
>>> >>> > way to solve data skew in Hive.  Is there any equivalent way in
>>> Spark?
>>> >>> >
>>> >>> > Thanks
>>> >>> > Chang
>>> >>> >
>>> >>> > On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan 
>>> >>>
>>> >>> > cloud0fan@
>>> >>>
>>> >>> >  wrote:
>>> >>> >
>>> >>> >> It’s not about case when, but about rand(). Non-deterministic
>>> >>> expressions
>>> >>> >> are not allowed in join condition.
>>> >>> >>
>>> >>> >> > On 13 Jul 2017, at 6:43 PM, wangshuang 
>>> >>>
>>> >>> > cn_wss@
>>> >>>
>>> >>> >  wrote:
>>> >>> >> >
>>> >>> >> > I'm trying to execute hive sql on spark sql (Also on spark
>>> >>> >> thriftserver), For
>>> >>> >> > optimizing data skew, we use "case when" to handle null.
>>> >>> >> > Simple sql as following:
>>> >>> >> >
>>> >>> >> >
>>> >>> >> > SELECT a.col1
>>> >>> >> > FROM tbl1 a
>>> >>> >> > LEFT OUTER JOIN tbl2 b
>>> >>> >> > ON
>>> >>> >> > * CASE
>>> >>> >> >   WHEN a.col2 IS NULL
>>> >>> >> >   TNEN cast(rand(9)*1000 - 99 as
>>> >>> string)
>>> >>> >> >   ELSE
>>> >>> >> >   a.col2 END *
>>> >>> >> >   = b.col3;
>>> >>> >> >
>>> >>> >> >
>>> >>> >> > But I get the error:
>>> >>> >> >
>>> >>> >> > == Physical Plan ==
>>> >>> >> > *org.apache.spark.sql.AnalysisException: nondeterministic
>>> >>> expressions
>>> >>> >> are
>>> >>> >> > only allowed in
>>> >>> >> > Project, Filter, Aggregate or Window, found:*
>>> >>> >> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) *
>>> CAST(1000
>>> >>> AS
>>> >>> >> > DOUBLE)) - CAST(99L AS DOUBLE)) AS STRING) ELSE
>>> >>> a.`nav_tcdt`
>>> >>> >> END
>>> >>> >> =
>>> >>> >> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND
>>> >>> >> (c.`cur_flag`
>>> >>> >> =
>>> >>> >> > 1))
>>> >>> >> > in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25)
>>> THEN
>>> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(99 as
>>> >>> double))
>>> >>> as
>>> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) &&
>>> >>> (cast(nav_tcd#26
>>> >>> as
>>> >>> >> int)
>>> >>> >> > = 9)) && (cur_flag#77 = 1))
>>> >>> >> >   ;;
>>> >>> >> > 

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Liang-Chi Hsieh

I created a draft pull request for explaining the cases:
https://github.com/apache/spark/pull/18652



Chang Chen wrote
> Hi All
> 
> I don't understand the difference between the semantics, I found Spark
> does
> the same thing for GroupBy non-deterministic. From Map-Reduce point of
> view, Join is also GroupBy in essence .
> 
> @Liang Chi Hsieh
> https://plus.google.com/u/0/103179362592085650735?prsrc=4;
> 
> in which situation,  semantics  will be changed?
> 
> Thanks
> Chang
> 
> On Mon, Jul 17, 2017 at 3:29 PM, Liang-Chi Hsieh 

> viirya@

>  wrote:
> 
>>
>> Thinking about it more, I think it changes the semantics only under
>> certain
>> scenarios.
>>
>> For the example SQL query shown in previous discussion, it looks the same
>> semantics.
>>
>>
>> Xiao Li wrote
>> > If the join condition is non-deterministic, pushing it down to the
>> > underlying project will change the semantics. Thus, we are unable to do
>> it
>> > in PullOutNondeterministic. Users can do it manually if they do not
>> care
>> > the semantics difference.
>> >
>> > Thanks,
>> >
>> > Xiao
>> >
>> >
>> >
>> > 2017-07-16 20:07 GMT-07:00 Chang Chen 
>>
>> > baibaichen@
>>
>> > :
>> >
>> >> It is tedious since we have lots of Hive SQL being migrated to Spark.
>> >> And
>> >> this workaround is equivalent  to insert a Project between Join
>> operator
>> >> and its child.
>> >>
>> >> Why not do it in PullOutNondeterministic?
>> >>
>> >> Thanks
>> >> Chang
>> >>
>> >>
>> >> On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh 
>>
>> > viirya@
>>
>> >  wrote:
>> >>
>> >>>
>> >>> A possible workaround is to add the rand column into tbl1 with a
>> >>> projection
>> >>> before the join.
>> >>>
>> >>> SELECT a.col1
>> >>> FROM (
>> >>>   SELECT col1,
>> >>> CASE
>> >>>  WHEN col2 IS NULL
>> >>>THEN cast(rand(9)*1000 - 99 as string)
>> >>>  ELSE
>> >>>col2
>> >>> END AS col2
>> >>> FROM tbl1) a
>> >>> LEFT OUTER JOIN tbl2 b
>> >>> ON a.col2 = b.col3;
>> >>>
>> >>>
>> >>>
>> >>> Chang Chen wrote
>> >>> > Hi Wenchen
>> >>> >
>> >>> > Yes. We also find this error is caused by Rand. However, this is
>> >>> classic
>> >>> > way to solve data skew in Hive.  Is there any equivalent way in
>> Spark?
>> >>> >
>> >>> > Thanks
>> >>> > Chang
>> >>> >
>> >>> > On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan 
>> >>>
>> >>> > cloud0fan@
>> >>>
>> >>> >  wrote:
>> >>> >
>> >>> >> It’s not about case when, but about rand(). Non-deterministic
>> >>> expressions
>> >>> >> are not allowed in join condition.
>> >>> >>
>> >>> >> > On 13 Jul 2017, at 6:43 PM, wangshuang 
>> >>>
>> >>> > cn_wss@
>> >>>
>> >>> >  wrote:
>> >>> >> >
>> >>> >> > I'm trying to execute hive sql on spark sql (Also on spark
>> >>> >> thriftserver), For
>> >>> >> > optimizing data skew, we use "case when" to handle null.
>> >>> >> > Simple sql as following:
>> >>> >> >
>> >>> >> >
>> >>> >> > SELECT a.col1
>> >>> >> > FROM tbl1 a
>> >>> >> > LEFT OUTER JOIN tbl2 b
>> >>> >> > ON
>> >>> >> > * CASE
>> >>> >> >   WHEN a.col2 IS NULL
>> >>> >> >   TNEN cast(rand(9)*1000 - 99 as
>> >>> string)
>> >>> >> >   ELSE
>> >>> >> >   a.col2 END *
>> >>> >> >   = b.col3;
>> >>> >> >
>> >>> >> >
>> >>> >> > But I get the error:
>> >>> >> >
>> >>> >> > == Physical Plan ==
>> >>> >> > *org.apache.spark.sql.AnalysisException: nondeterministic
>> >>> expressions
>> >>> >> are
>> >>> >> > only allowed in
>> >>> >> > Project, Filter, Aggregate or Window, found:*
>> >>> >> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) *
>> CAST(1000
>> >>> AS
>> >>> >> > DOUBLE)) - CAST(99L AS DOUBLE)) AS STRING) ELSE
>> >>> a.`nav_tcdt`
>> >>> >> END
>> >>> >> =
>> >>> >> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND
>> >>> >> (c.`cur_flag`
>> >>> >> =
>> >>> >> > 1))
>> >>> >> > in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25)
>> THEN
>> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(99 as
>> >>> double))
>> >>> as
>> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) &&
>> >>> (cast(nav_tcd#26
>> >>> as
>> >>> >> int)
>> >>> >> > = 9)) && (cur_flag#77 = 1))
>> >>> >> >   ;;
>> >>> >> > GlobalLimit 10
>> >>> >> > +- LocalLimit 10
>> >>> >> >   +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as string)
>> IN
>> >>> >> > (cast(19596 as string),cast(20134 as string),cast(10997 as
>> string))
>> >>> &&
>> >>> >> > nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE nav_tpa_id#21
>> >>> END],
>> >>> >> > [date_id#7]
>> >>> >> >  +- Filter (date_id#7 = 2017-07-12)
>> >>> >> > +- Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
>> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(99 as
>> >>> double))
>> >>> as
>> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) &&
>> >>> (cast(nav_tcd#26
>> >>> as
>> >>> >> int)
>> >>> >> > = 9)) && (cur_flag#77 = 1))
>> >>> >> >:- 

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread 蒋星博
FYI there have been a related discussion here:
https://github.com/apache/spark/pull/15417#discussion_r85295977

2017-07-17 15:44 GMT+08:00 Chang Chen :

> Hi All
>
> I don't understand the difference between the semantics, I found Spark
> does the same thing for GroupBy non-deterministic. From Map-Reduce point of
> view, Join is also GroupBy in essence .
>
> @Liang Chi Hsieh
> 
>
> in which situation,  semantics  will be changed?
>
> Thanks
> Chang
>
> On Mon, Jul 17, 2017 at 3:29 PM, Liang-Chi Hsieh  wrote:
>
>>
>> Thinking about it more, I think it changes the semantics only under
>> certain
>> scenarios.
>>
>> For the example SQL query shown in previous discussion, it looks the same
>> semantics.
>>
>>
>> Xiao Li wrote
>> > If the join condition is non-deterministic, pushing it down to the
>> > underlying project will change the semantics. Thus, we are unable to do
>> it
>> > in PullOutNondeterministic. Users can do it manually if they do not care
>> > the semantics difference.
>> >
>> > Thanks,
>> >
>> > Xiao
>> >
>> >
>> >
>> > 2017-07-16 20:07 GMT-07:00 Chang Chen 
>>
>> > baibaichen@
>>
>> > :
>> >
>> >> It is tedious since we have lots of Hive SQL being migrated to Spark.
>> >> And
>> >> this workaround is equivalent  to insert a Project between Join
>> operator
>> >> and its child.
>> >>
>> >> Why not do it in PullOutNondeterministic?
>> >>
>> >> Thanks
>> >> Chang
>> >>
>> >>
>> >> On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh 
>>
>> > viirya@
>>
>> >  wrote:
>> >>
>> >>>
>> >>> A possible workaround is to add the rand column into tbl1 with a
>> >>> projection
>> >>> before the join.
>> >>>
>> >>> SELECT a.col1
>> >>> FROM (
>> >>>   SELECT col1,
>> >>> CASE
>> >>>  WHEN col2 IS NULL
>> >>>THEN cast(rand(9)*1000 - 99 as string)
>> >>>  ELSE
>> >>>col2
>> >>> END AS col2
>> >>> FROM tbl1) a
>> >>> LEFT OUTER JOIN tbl2 b
>> >>> ON a.col2 = b.col3;
>> >>>
>> >>>
>> >>>
>> >>> Chang Chen wrote
>> >>> > Hi Wenchen
>> >>> >
>> >>> > Yes. We also find this error is caused by Rand. However, this is
>> >>> classic
>> >>> > way to solve data skew in Hive.  Is there any equivalent way in
>> Spark?
>> >>> >
>> >>> > Thanks
>> >>> > Chang
>> >>> >
>> >>> > On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan 
>> >>>
>> >>> > cloud0fan@
>> >>>
>> >>> >  wrote:
>> >>> >
>> >>> >> It’s not about case when, but about rand(). Non-deterministic
>> >>> expressions
>> >>> >> are not allowed in join condition.
>> >>> >>
>> >>> >> > On 13 Jul 2017, at 6:43 PM, wangshuang 
>> >>>
>> >>> > cn_wss@
>> >>>
>> >>> >  wrote:
>> >>> >> >
>> >>> >> > I'm trying to execute hive sql on spark sql (Also on spark
>> >>> >> thriftserver), For
>> >>> >> > optimizing data skew, we use "case when" to handle null.
>> >>> >> > Simple sql as following:
>> >>> >> >
>> >>> >> >
>> >>> >> > SELECT a.col1
>> >>> >> > FROM tbl1 a
>> >>> >> > LEFT OUTER JOIN tbl2 b
>> >>> >> > ON
>> >>> >> > * CASE
>> >>> >> >   WHEN a.col2 IS NULL
>> >>> >> >   TNEN cast(rand(9)*1000 - 99 as
>> >>> string)
>> >>> >> >   ELSE
>> >>> >> >   a.col2 END *
>> >>> >> >   = b.col3;
>> >>> >> >
>> >>> >> >
>> >>> >> > But I get the error:
>> >>> >> >
>> >>> >> > == Physical Plan ==
>> >>> >> > *org.apache.spark.sql.AnalysisException: nondeterministic
>> >>> expressions
>> >>> >> are
>> >>> >> > only allowed in
>> >>> >> > Project, Filter, Aggregate or Window, found:*
>> >>> >> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) *
>> CAST(1000
>> >>> AS
>> >>> >> > DOUBLE)) - CAST(99L AS DOUBLE)) AS STRING) ELSE
>> >>> a.`nav_tcdt`
>> >>> >> END
>> >>> >> =
>> >>> >> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND
>> >>> >> (c.`cur_flag`
>> >>> >> =
>> >>> >> > 1))
>> >>> >> > in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
>> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(99 as
>> >>> double))
>> >>> as
>> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) &&
>> >>> (cast(nav_tcd#26
>> >>> as
>> >>> >> int)
>> >>> >> > = 9)) && (cur_flag#77 = 1))
>> >>> >> >   ;;
>> >>> >> > GlobalLimit 10
>> >>> >> > +- LocalLimit 10
>> >>> >> >   +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as string)
>> IN
>> >>> >> > (cast(19596 as string),cast(20134 as string),cast(10997 as
>> string))
>> >>> &&
>> >>> >> > nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE nav_tpa_id#21
>> >>> END],
>> >>> >> > [date_id#7]
>> >>> >> >  +- Filter (date_id#7 = 2017-07-12)
>> >>> >> > +- Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
>> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(99 as
>> >>> double))
>> >>> as
>> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) &&
>> >>> (cast(nav_tcd#26
>> >>> as
>> >>> >> int)
>> >>> >> > = 9)) && 

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Chang Chen
Hi All

I don't understand the difference between the semantics, I found Spark does
the same thing for GroupBy non-deterministic. From Map-Reduce point of
view, Join is also GroupBy in essence .

@Liang Chi Hsieh 

in which situation,  semantics  will be changed?

Thanks
Chang

On Mon, Jul 17, 2017 at 3:29 PM, Liang-Chi Hsieh  wrote:

>
> Thinking about it more, I think it changes the semantics only under certain
> scenarios.
>
> For the example SQL query shown in previous discussion, it looks the same
> semantics.
>
>
> Xiao Li wrote
> > If the join condition is non-deterministic, pushing it down to the
> > underlying project will change the semantics. Thus, we are unable to do
> it
> > in PullOutNondeterministic. Users can do it manually if they do not care
> > the semantics difference.
> >
> > Thanks,
> >
> > Xiao
> >
> >
> >
> > 2017-07-16 20:07 GMT-07:00 Chang Chen 
>
> > baibaichen@
>
> > :
> >
> >> It is tedious since we have lots of Hive SQL being migrated to Spark.
> >> And
> >> this workaround is equivalent  to insert a Project between Join operator
> >> and its child.
> >>
> >> Why not do it in PullOutNondeterministic?
> >>
> >> Thanks
> >> Chang
> >>
> >>
> >> On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh 
>
> > viirya@
>
> >  wrote:
> >>
> >>>
> >>> A possible workaround is to add the rand column into tbl1 with a
> >>> projection
> >>> before the join.
> >>>
> >>> SELECT a.col1
> >>> FROM (
> >>>   SELECT col1,
> >>> CASE
> >>>  WHEN col2 IS NULL
> >>>THEN cast(rand(9)*1000 - 99 as string)
> >>>  ELSE
> >>>col2
> >>> END AS col2
> >>> FROM tbl1) a
> >>> LEFT OUTER JOIN tbl2 b
> >>> ON a.col2 = b.col3;
> >>>
> >>>
> >>>
> >>> Chang Chen wrote
> >>> > Hi Wenchen
> >>> >
> >>> > Yes. We also find this error is caused by Rand. However, this is
> >>> classic
> >>> > way to solve data skew in Hive.  Is there any equivalent way in
> Spark?
> >>> >
> >>> > Thanks
> >>> > Chang
> >>> >
> >>> > On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan 
> >>>
> >>> > cloud0fan@
> >>>
> >>> >  wrote:
> >>> >
> >>> >> It’s not about case when, but about rand(). Non-deterministic
> >>> expressions
> >>> >> are not allowed in join condition.
> >>> >>
> >>> >> > On 13 Jul 2017, at 6:43 PM, wangshuang 
> >>>
> >>> > cn_wss@
> >>>
> >>> >  wrote:
> >>> >> >
> >>> >> > I'm trying to execute hive sql on spark sql (Also on spark
> >>> >> thriftserver), For
> >>> >> > optimizing data skew, we use "case when" to handle null.
> >>> >> > Simple sql as following:
> >>> >> >
> >>> >> >
> >>> >> > SELECT a.col1
> >>> >> > FROM tbl1 a
> >>> >> > LEFT OUTER JOIN tbl2 b
> >>> >> > ON
> >>> >> > * CASE
> >>> >> >   WHEN a.col2 IS NULL
> >>> >> >   TNEN cast(rand(9)*1000 - 99 as
> >>> string)
> >>> >> >   ELSE
> >>> >> >   a.col2 END *
> >>> >> >   = b.col3;
> >>> >> >
> >>> >> >
> >>> >> > But I get the error:
> >>> >> >
> >>> >> > == Physical Plan ==
> >>> >> > *org.apache.spark.sql.AnalysisException: nondeterministic
> >>> expressions
> >>> >> are
> >>> >> > only allowed in
> >>> >> > Project, Filter, Aggregate or Window, found:*
> >>> >> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) *
> CAST(1000
> >>> AS
> >>> >> > DOUBLE)) - CAST(99L AS DOUBLE)) AS STRING) ELSE
> >>> a.`nav_tcdt`
> >>> >> END
> >>> >> =
> >>> >> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND
> >>> >> (c.`cur_flag`
> >>> >> =
> >>> >> > 1))
> >>> >> > in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(99 as
> >>> double))
> >>> as
> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) &&
> >>> (cast(nav_tcd#26
> >>> as
> >>> >> int)
> >>> >> > = 9)) && (cur_flag#77 = 1))
> >>> >> >   ;;
> >>> >> > GlobalLimit 10
> >>> >> > +- LocalLimit 10
> >>> >> >   +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as string)
> IN
> >>> >> > (cast(19596 as string),cast(20134 as string),cast(10997 as
> string))
> >>> &&
> >>> >> > nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE nav_tpa_id#21
> >>> END],
> >>> >> > [date_id#7]
> >>> >> >  +- Filter (date_id#7 = 2017-07-12)
> >>> >> > +- Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(99 as
> >>> double))
> >>> as
> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) &&
> >>> (cast(nav_tcd#26
> >>> as
> >>> >> int)
> >>> >> > = 9)) && (cur_flag#77 = 1))
> >>> >> >:- SubqueryAlias a
> >>> >> >:  +- SubqueryAlias tmp_lifan_trfc_tpa_hive
> >>> >> >: +- CatalogRelation
> >>> `tmp`.`tmp_lifan_trfc_tpa_hive`,
> >>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [date_id#7,
> >>> >> chanl_id#8L,
> >>> >> > pltfm_id#9, city_id#10, sessn_id#11, gu_id#12,
> 

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Liang-Chi Hsieh

Thinking about it more, I think it changes the semantics only under certain
scenarios.

For the example SQL query shown in previous discussion, it looks the same
semantics.


Xiao Li wrote
> If the join condition is non-deterministic, pushing it down to the
> underlying project will change the semantics. Thus, we are unable to do it
> in PullOutNondeterministic. Users can do it manually if they do not care
> the semantics difference.
> 
> Thanks,
> 
> Xiao
> 
> 
> 
> 2017-07-16 20:07 GMT-07:00 Chang Chen 

> baibaichen@

> :
> 
>> It is tedious since we have lots of Hive SQL being migrated to Spark. 
>> And
>> this workaround is equivalent  to insert a Project between Join operator
>> and its child.
>>
>> Why not do it in PullOutNondeterministic?
>>
>> Thanks
>> Chang
>>
>>
>> On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh 

> viirya@

>  wrote:
>>
>>>
>>> A possible workaround is to add the rand column into tbl1 with a
>>> projection
>>> before the join.
>>>
>>> SELECT a.col1
>>> FROM (
>>>   SELECT col1,
>>> CASE
>>>  WHEN col2 IS NULL
>>>THEN cast(rand(9)*1000 - 99 as string)
>>>  ELSE
>>>col2
>>> END AS col2
>>> FROM tbl1) a
>>> LEFT OUTER JOIN tbl2 b
>>> ON a.col2 = b.col3;
>>>
>>>
>>>
>>> Chang Chen wrote
>>> > Hi Wenchen
>>> >
>>> > Yes. We also find this error is caused by Rand. However, this is
>>> classic
>>> > way to solve data skew in Hive.  Is there any equivalent way in Spark?
>>> >
>>> > Thanks
>>> > Chang
>>> >
>>> > On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan 
>>>
>>> > cloud0fan@
>>>
>>> >  wrote:
>>> >
>>> >> It’s not about case when, but about rand(). Non-deterministic
>>> expressions
>>> >> are not allowed in join condition.
>>> >>
>>> >> > On 13 Jul 2017, at 6:43 PM, wangshuang 
>>>
>>> > cn_wss@
>>>
>>> >  wrote:
>>> >> >
>>> >> > I'm trying to execute hive sql on spark sql (Also on spark
>>> >> thriftserver), For
>>> >> > optimizing data skew, we use "case when" to handle null.
>>> >> > Simple sql as following:
>>> >> >
>>> >> >
>>> >> > SELECT a.col1
>>> >> > FROM tbl1 a
>>> >> > LEFT OUTER JOIN tbl2 b
>>> >> > ON
>>> >> > * CASE
>>> >> >   WHEN a.col2 IS NULL
>>> >> >   TNEN cast(rand(9)*1000 - 99 as
>>> string)
>>> >> >   ELSE
>>> >> >   a.col2 END *
>>> >> >   = b.col3;
>>> >> >
>>> >> >
>>> >> > But I get the error:
>>> >> >
>>> >> > == Physical Plan ==
>>> >> > *org.apache.spark.sql.AnalysisException: nondeterministic
>>> expressions
>>> >> are
>>> >> > only allowed in
>>> >> > Project, Filter, Aggregate or Window, found:*
>>> >> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) * CAST(1000
>>> AS
>>> >> > DOUBLE)) - CAST(99L AS DOUBLE)) AS STRING) ELSE
>>> a.`nav_tcdt`
>>> >> END
>>> >> =
>>> >> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND
>>> >> (c.`cur_flag`
>>> >> =
>>> >> > 1))
>>> >> > in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
>>> >> > cast(((rand(9) * cast(1000 as double)) - cast(99 as
>>> double))
>>> as
>>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) &&
>>> (cast(nav_tcd#26
>>> as
>>> >> int)
>>> >> > = 9)) && (cur_flag#77 = 1))
>>> >> >   ;;
>>> >> > GlobalLimit 10
>>> >> > +- LocalLimit 10
>>> >> >   +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as string) IN
>>> >> > (cast(19596 as string),cast(20134 as string),cast(10997 as string))
>>> &&
>>> >> > nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE nav_tpa_id#21
>>> END],
>>> >> > [date_id#7]
>>> >> >  +- Filter (date_id#7 = 2017-07-12)
>>> >> > +- Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
>>> >> > cast(((rand(9) * cast(1000 as double)) - cast(99 as
>>> double))
>>> as
>>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) &&
>>> (cast(nav_tcd#26
>>> as
>>> >> int)
>>> >> > = 9)) && (cur_flag#77 = 1))
>>> >> >:- SubqueryAlias a
>>> >> >:  +- SubqueryAlias tmp_lifan_trfc_tpa_hive
>>> >> >: +- CatalogRelation
>>> `tmp`.`tmp_lifan_trfc_tpa_hive`,
>>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [date_id#7,
>>> >> chanl_id#8L,
>>> >> > pltfm_id#9, city_id#10, sessn_id#11, gu_id#12,
>>> >> nav_refer_page_type_id#13,
>>> >> > nav_refer_page_value#14, nav_refer_tpa#15, nav_refer_tpa_id#16,
>>> >> > nav_refer_tpc#17, nav_refer_tpi#18, nav_page_type_id#19,
>>> >> nav_page_value#20,
>>> >> > nav_tpa_id#21, nav_tpa#22, nav_tpc#23, nav_tpi#24, nav_tcdt#25,
>>> >> nav_tcd#26,
>>> >> > nav_tci#27, nav_tce#28, detl_refer_page_type_id#29,
>>> >> > detl_refer_page_value#30, ... 33 more fields]
>>> >> >+- SubqueryAlias c
>>> >> >   +- SubqueryAlias dim_site_categ_ext
>>> >> >  +- CatalogRelation `dw`.`dim_site_categ_ext`,
>>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
>>> >> [site_categ_skid#64L,
>>> >> > site_categ_type#65, site_categ_code#66, 

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-16 Thread Xiao Li
If the join condition is non-deterministic, pushing it down to the
underlying project will change the semantics. Thus, we are unable to do it
in PullOutNondeterministic. Users can do it manually if they do not care
the semantics difference.

Thanks,

Xiao



2017-07-16 20:07 GMT-07:00 Chang Chen :

> It is tedious since we have lots of Hive SQL being migrated to Spark.  And
> this workaround is equivalent  to insert a Project between Join operator
> and its child.
>
> Why not do it in PullOutNondeterministic?
>
> Thanks
> Chang
>
>
> On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh  wrote:
>
>>
>> A possible workaround is to add the rand column into tbl1 with a
>> projection
>> before the join.
>>
>> SELECT a.col1
>> FROM (
>>   SELECT col1,
>> CASE
>>  WHEN col2 IS NULL
>>THEN cast(rand(9)*1000 - 99 as string)
>>  ELSE
>>col2
>> END AS col2
>> FROM tbl1) a
>> LEFT OUTER JOIN tbl2 b
>> ON a.col2 = b.col3;
>>
>>
>>
>> Chang Chen wrote
>> > Hi Wenchen
>> >
>> > Yes. We also find this error is caused by Rand. However, this is classic
>> > way to solve data skew in Hive.  Is there any equivalent way in Spark?
>> >
>> > Thanks
>> > Chang
>> >
>> > On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan 
>>
>> > cloud0fan@
>>
>> >  wrote:
>> >
>> >> It’s not about case when, but about rand(). Non-deterministic
>> expressions
>> >> are not allowed in join condition.
>> >>
>> >> > On 13 Jul 2017, at 6:43 PM, wangshuang 
>>
>> > cn_wss@
>>
>> >  wrote:
>> >> >
>> >> > I'm trying to execute hive sql on spark sql (Also on spark
>> >> thriftserver), For
>> >> > optimizing data skew, we use "case when" to handle null.
>> >> > Simple sql as following:
>> >> >
>> >> >
>> >> > SELECT a.col1
>> >> > FROM tbl1 a
>> >> > LEFT OUTER JOIN tbl2 b
>> >> > ON
>> >> > * CASE
>> >> >   WHEN a.col2 IS NULL
>> >> >   TNEN cast(rand(9)*1000 - 99 as string)
>> >> >   ELSE
>> >> >   a.col2 END *
>> >> >   = b.col3;
>> >> >
>> >> >
>> >> > But I get the error:
>> >> >
>> >> > == Physical Plan ==
>> >> > *org.apache.spark.sql.AnalysisException: nondeterministic
>> expressions
>> >> are
>> >> > only allowed in
>> >> > Project, Filter, Aggregate or Window, found:*
>> >> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) * CAST(1000
>> AS
>> >> > DOUBLE)) - CAST(99L AS DOUBLE)) AS STRING) ELSE a.`nav_tcdt`
>> >> END
>> >> =
>> >> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND
>> >> (c.`cur_flag`
>> >> =
>> >> > 1))
>> >> > in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
>> >> > cast(((rand(9) * cast(1000 as double)) - cast(99 as double))
>> as
>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && (cast(nav_tcd#26
>> as
>> >> int)
>> >> > = 9)) && (cur_flag#77 = 1))
>> >> >   ;;
>> >> > GlobalLimit 10
>> >> > +- LocalLimit 10
>> >> >   +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as string) IN
>> >> > (cast(19596 as string),cast(20134 as string),cast(10997 as string))
>> &&
>> >> > nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE nav_tpa_id#21 END],
>> >> > [date_id#7]
>> >> >  +- Filter (date_id#7 = 2017-07-12)
>> >> > +- Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
>> >> > cast(((rand(9) * cast(1000 as double)) - cast(99 as double))
>> as
>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && (cast(nav_tcd#26
>> as
>> >> int)
>> >> > = 9)) && (cur_flag#77 = 1))
>> >> >:- SubqueryAlias a
>> >> >:  +- SubqueryAlias tmp_lifan_trfc_tpa_hive
>> >> >: +- CatalogRelation `tmp`.`tmp_lifan_trfc_tpa_hive`,
>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [date_id#7,
>> >> chanl_id#8L,
>> >> > pltfm_id#9, city_id#10, sessn_id#11, gu_id#12,
>> >> nav_refer_page_type_id#13,
>> >> > nav_refer_page_value#14, nav_refer_tpa#15, nav_refer_tpa_id#16,
>> >> > nav_refer_tpc#17, nav_refer_tpi#18, nav_page_type_id#19,
>> >> nav_page_value#20,
>> >> > nav_tpa_id#21, nav_tpa#22, nav_tpc#23, nav_tpi#24, nav_tcdt#25,
>> >> nav_tcd#26,
>> >> > nav_tci#27, nav_tce#28, detl_refer_page_type_id#29,
>> >> > detl_refer_page_value#30, ... 33 more fields]
>> >> >+- SubqueryAlias c
>> >> >   +- SubqueryAlias dim_site_categ_ext
>> >> >  +- CatalogRelation `dw`.`dim_site_categ_ext`,
>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
>> >> [site_categ_skid#64L,
>> >> > site_categ_type#65, site_categ_code#66, site_categ_name#67,
>> >> > site_categ_parnt_skid#68L, site_categ_kywrd#69, leaf_flg#70L,
>> >> sort_seq#71L,
>> >> > site_categ_srch_name#72, vsbl_flg#73, delet_flag#74,
>> etl_batch_id#75L,
>> >> > updt_time#76, cur_flag#77, bkgrnd_categ_skid#78L,
>> bkgrnd_categ_id#79L,
>> >> > site_categ_id#80, site_categ_parnt_id#81]
>> >> >
>> >> > Does spark sql not support syntax "case when" in JOIN?  

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-16 Thread Chang Chen
It is tedious since we have lots of Hive SQL being migrated to Spark.  And
this workaround is equivalent  to insert a Project between Join operator
and its child.

Why not do it in PullOutNondeterministic?

Thanks
Chang

On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh  wrote:

>
> A possible workaround is to add the rand column into tbl1 with a projection
> before the join.
>
> SELECT a.col1
> FROM (
>   SELECT col1,
> CASE
>  WHEN col2 IS NULL
>THEN cast(rand(9)*1000 - 99 as string)
>  ELSE
>col2
> END AS col2
> FROM tbl1) a
> LEFT OUTER JOIN tbl2 b
> ON a.col2 = b.col3;
>
>
>
> Chang Chen wrote
> > Hi Wenchen
> >
> > Yes. We also find this error is caused by Rand. However, this is classic
> > way to solve data skew in Hive.  Is there any equivalent way in Spark?
> >
> > Thanks
> > Chang
> >
> > On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan 
>
> > cloud0fan@
>
> >  wrote:
> >
> >> It’s not about case when, but about rand(). Non-deterministic
> expressions
> >> are not allowed in join condition.
> >>
> >> > On 13 Jul 2017, at 6:43 PM, wangshuang 
>
> > cn_wss@
>
> >  wrote:
> >> >
> >> > I'm trying to execute hive sql on spark sql (Also on spark
> >> thriftserver), For
> >> > optimizing data skew, we use "case when" to handle null.
> >> > Simple sql as following:
> >> >
> >> >
> >> > SELECT a.col1
> >> > FROM tbl1 a
> >> > LEFT OUTER JOIN tbl2 b
> >> > ON
> >> > * CASE
> >> >   WHEN a.col2 IS NULL
> >> >   TNEN cast(rand(9)*1000 - 99 as string)
> >> >   ELSE
> >> >   a.col2 END *
> >> >   = b.col3;
> >> >
> >> >
> >> > But I get the error:
> >> >
> >> > == Physical Plan ==
> >> > *org.apache.spark.sql.AnalysisException: nondeterministic expressions
> >> are
> >> > only allowed in
> >> > Project, Filter, Aggregate or Window, found:*
> >> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) * CAST(1000 AS
> >> > DOUBLE)) - CAST(99L AS DOUBLE)) AS STRING) ELSE a.`nav_tcdt`
> >> END
> >> =
> >> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND
> >> (c.`cur_flag`
> >> =
> >> > 1))
> >> > in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
> >> > cast(((rand(9) * cast(1000 as double)) - cast(99 as double))
> as
> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && (cast(nav_tcd#26
> as
> >> int)
> >> > = 9)) && (cur_flag#77 = 1))
> >> >   ;;
> >> > GlobalLimit 10
> >> > +- LocalLimit 10
> >> >   +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as string) IN
> >> > (cast(19596 as string),cast(20134 as string),cast(10997 as string)) &&
> >> > nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE nav_tpa_id#21 END],
> >> > [date_id#7]
> >> >  +- Filter (date_id#7 = 2017-07-12)
> >> > +- Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
> >> > cast(((rand(9) * cast(1000 as double)) - cast(99 as double))
> as
> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && (cast(nav_tcd#26
> as
> >> int)
> >> > = 9)) && (cur_flag#77 = 1))
> >> >:- SubqueryAlias a
> >> >:  +- SubqueryAlias tmp_lifan_trfc_tpa_hive
> >> >: +- CatalogRelation `tmp`.`tmp_lifan_trfc_tpa_hive`,
> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [date_id#7,
> >> chanl_id#8L,
> >> > pltfm_id#9, city_id#10, sessn_id#11, gu_id#12,
> >> nav_refer_page_type_id#13,
> >> > nav_refer_page_value#14, nav_refer_tpa#15, nav_refer_tpa_id#16,
> >> > nav_refer_tpc#17, nav_refer_tpi#18, nav_page_type_id#19,
> >> nav_page_value#20,
> >> > nav_tpa_id#21, nav_tpa#22, nav_tpc#23, nav_tpi#24, nav_tcdt#25,
> >> nav_tcd#26,
> >> > nav_tci#27, nav_tce#28, detl_refer_page_type_id#29,
> >> > detl_refer_page_value#30, ... 33 more fields]
> >> >+- SubqueryAlias c
> >> >   +- SubqueryAlias dim_site_categ_ext
> >> >  +- CatalogRelation `dw`.`dim_site_categ_ext`,
> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
> >> [site_categ_skid#64L,
> >> > site_categ_type#65, site_categ_code#66, site_categ_name#67,
> >> > site_categ_parnt_skid#68L, site_categ_kywrd#69, leaf_flg#70L,
> >> sort_seq#71L,
> >> > site_categ_srch_name#72, vsbl_flg#73, delet_flag#74, etl_batch_id#75L,
> >> > updt_time#76, cur_flag#77, bkgrnd_categ_skid#78L, bkgrnd_categ_id#79L,
> >> > site_categ_id#80, site_categ_parnt_id#81]
> >> >
> >> > Does spark sql not support syntax "case when" in JOIN?  Additional, my
> >> spark
> >> > version is 2.2.0.
> >> > Any help would be greatly appreciated.
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context: http://apache-spark-developers
> >> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-
> >> be-supported-in-JOIN-tp21953.html
> >> > Sent from the Apache Spark Developers List mailing list archive at
> >> Nabble.com.
> >> >
> >> > -
> >> > To 

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-14 Thread Liang-Chi Hsieh

A possible workaround is to add the rand column into tbl1 with a projection
before the join.

SELECT a.col1
FROM (
  SELECT col1,
CASE
 WHEN col2 IS NULL
   THEN cast(rand(9)*1000 - 99 as string)
 ELSE
   col2
END AS col2
FROM tbl1) a
LEFT OUTER JOIN tbl2 b
ON a.col2 = b.col3;



Chang Chen wrote
> Hi Wenchen
> 
> Yes. We also find this error is caused by Rand. However, this is classic
> way to solve data skew in Hive.  Is there any equivalent way in Spark?
> 
> Thanks
> Chang
> 
> On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan 

> cloud0fan@

>  wrote:
> 
>> It’s not about case when, but about rand(). Non-deterministic expressions
>> are not allowed in join condition.
>>
>> > On 13 Jul 2017, at 6:43 PM, wangshuang 

> cn_wss@

>  wrote:
>> >
>> > I'm trying to execute hive sql on spark sql (Also on spark
>> thriftserver), For
>> > optimizing data skew, we use "case when" to handle null.
>> > Simple sql as following:
>> >
>> >
>> > SELECT a.col1
>> > FROM tbl1 a
>> > LEFT OUTER JOIN tbl2 b
>> > ON
>> > * CASE
>> >   WHEN a.col2 IS NULL
>> >   TNEN cast(rand(9)*1000 - 99 as string)
>> >   ELSE
>> >   a.col2 END *
>> >   = b.col3;
>> >
>> >
>> > But I get the error:
>> >
>> > == Physical Plan ==
>> > *org.apache.spark.sql.AnalysisException: nondeterministic expressions
>> are
>> > only allowed in
>> > Project, Filter, Aggregate or Window, found:*
>> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) * CAST(1000 AS
>> > DOUBLE)) - CAST(99L AS DOUBLE)) AS STRING) ELSE a.`nav_tcdt`
>> END
>> =
>> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND
>> (c.`cur_flag`
>> =
>> > 1))
>> > in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
>> > cast(((rand(9) * cast(1000 as double)) - cast(99 as double)) as
>> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && (cast(nav_tcd#26 as
>> int)
>> > = 9)) && (cur_flag#77 = 1))
>> >   ;;
>> > GlobalLimit 10
>> > +- LocalLimit 10
>> >   +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as string) IN
>> > (cast(19596 as string),cast(20134 as string),cast(10997 as string)) &&
>> > nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE nav_tpa_id#21 END],
>> > [date_id#7]
>> >  +- Filter (date_id#7 = 2017-07-12)
>> > +- Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
>> > cast(((rand(9) * cast(1000 as double)) - cast(99 as double)) as
>> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && (cast(nav_tcd#26 as
>> int)
>> > = 9)) && (cur_flag#77 = 1))
>> >:- SubqueryAlias a
>> >:  +- SubqueryAlias tmp_lifan_trfc_tpa_hive
>> >: +- CatalogRelation `tmp`.`tmp_lifan_trfc_tpa_hive`,
>> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [date_id#7,
>> chanl_id#8L,
>> > pltfm_id#9, city_id#10, sessn_id#11, gu_id#12,
>> nav_refer_page_type_id#13,
>> > nav_refer_page_value#14, nav_refer_tpa#15, nav_refer_tpa_id#16,
>> > nav_refer_tpc#17, nav_refer_tpi#18, nav_page_type_id#19,
>> nav_page_value#20,
>> > nav_tpa_id#21, nav_tpa#22, nav_tpc#23, nav_tpi#24, nav_tcdt#25,
>> nav_tcd#26,
>> > nav_tci#27, nav_tce#28, detl_refer_page_type_id#29,
>> > detl_refer_page_value#30, ... 33 more fields]
>> >+- SubqueryAlias c
>> >   +- SubqueryAlias dim_site_categ_ext
>> >  +- CatalogRelation `dw`.`dim_site_categ_ext`,
>> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
>> [site_categ_skid#64L,
>> > site_categ_type#65, site_categ_code#66, site_categ_name#67,
>> > site_categ_parnt_skid#68L, site_categ_kywrd#69, leaf_flg#70L,
>> sort_seq#71L,
>> > site_categ_srch_name#72, vsbl_flg#73, delet_flag#74, etl_batch_id#75L,
>> > updt_time#76, cur_flag#77, bkgrnd_categ_skid#78L, bkgrnd_categ_id#79L,
>> > site_categ_id#80, site_categ_parnt_id#81]
>> >
>> > Does spark sql not support syntax "case when" in JOIN?  Additional, my
>> spark
>> > version is 2.2.0.
>> > Any help would be greatly appreciated.
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-spark-developers
>> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-
>> be-supported-in-JOIN-tp21953.html
>> > Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>> >
>> > -
>> > To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>> >
>>
>>
>> -
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
>>





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be-supported-in-JOIN-tp21953p21961.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re:[SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-13 Thread StanZhai
A workaround is diffcult.
You should consider merging this PR 
 into your Spark.







 
 "wangshuang [via Apache Spark Developers 
List]" wroted at 2017-07-13 18:43:

I'm trying to execute hive sql on spark sql (Also on spark thriftserver), For 
optimizing data skew, we use "case when" to handle null.
Simple sql as following:


SELECT a.col1
 FROM tbl1 a
 LEFT OUTER JOIN tbl2 b
 ON
 CASE
 WHEN a.col2 IS NULL
 TNEN cast(rand(9)*1000 - 99 as string)
 ELSE
 a.col2 END
 = b.col3;


But I get the error:

== Physical Plan ==
org.apache.spark.sql.AnalysisException: nondeterministic expressions are only 
allowed in
Project, Filter, Aggregate or Window, found:
 (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) * CAST(1000 AS 
DOUBLE)) - CAST(99L AS DOUBLE)) AS STRING) ELSE a.`nav_tcdt` END = 
c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND (c.`cur_flag` = 1))
in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN 
cast(((rand(9) * cast(1000 as double)) - cast(99 as double)) as string) 
ELSE nav_tcdt#25 END = site_categ_id#80) && (cast(nav_tcd#26 as int) = 9)) && 
(cur_flag#77 = 1))
   ;;
GlobalLimit 10
+- LocalLimit 10
   +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as string) IN 
(cast(19596 as string),cast(20134 as string),cast(10997 as string)) && 
nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE nav_tpa_id#21 END], [date_id#7]
  +- Filter (date_id#7 = 2017-07-12)
 +- Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN 
cast(((rand(9) * cast(1000 as double)) - cast(99 as double)) as string) 
ELSE nav_tcdt#25 END = site_categ_id#80) && (cast(nav_tcd#26 as int) = 9)) && 
(cur_flag#77 = 1))
:- SubqueryAlias a
:  +- SubqueryAlias tmp_lifan_trfc_tpa_hive
: +- CatalogRelation `tmp`.`tmp_lifan_trfc_tpa_hive`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [date_id#7, chanl_id#8L, 
pltfm_id#9, city_id#10, sessn_id#11, gu_id#12, nav_refer_page_type_id#13, 
nav_refer_page_value#14, nav_refer_tpa#15, nav_refer_tpa_id#16, 
nav_refer_tpc#17, nav_refer_tpi#18, nav_page_type_id#19, nav_page_value#20, 
nav_tpa_id#21, nav_tpa#22, nav_tpc#23, nav_tpi#24, nav_tcdt#25, nav_tcd#26, 
nav_tci#27, nav_tce#28, detl_refer_page_type_id#29, detl_refer_page_value#30, 
... 33 more fields]
+- SubqueryAlias c
   +- SubqueryAlias dim_site_categ_ext
  +- CatalogRelation `dw`.`dim_site_categ_ext`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [site_categ_skid#64L, 
site_categ_type#65, site_categ_code#66, site_categ_name#67, 
site_categ_parnt_skid#68L, site_categ_kywrd#69, leaf_flg#70L, sort_seq#71L, 
site_categ_srch_name#72, vsbl_flg#73, delet_flag#74, etl_batch_id#75L, 
updt_time#76, cur_flag#77, bkgrnd_categ_skid#78L, bkgrnd_categ_id#79L, 
site_categ_id#80, site_categ_parnt_id#81]

Does spark sql not support syntax "case when" in JOIN?  Additional, my spark 
version is 2.2.0.
Any help would be greatly appreciated.

If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be-supported-in-JOIN-tp21953.html
To start a new topic under Apache Spark Developers List, email 
ml+s1001551n1...@n3.nabble.com
 To unsubscribe from Apache Spark Developers List, click here.
 NAML







--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Re-SQL-Syntax-case-when-doesn-t-be-supported-in-JOIN-tp21960.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-13 Thread Herman van Hövell tot Westerflier
Just move the case expression into an underlying select clause.

On Thu, Jul 13, 2017 at 3:10 PM, Chang Chen  wrote:

> Hi Wenchen
>
> Yes. We also find this error is caused by Rand. However, this is classic
> way to solve data skew in Hive.  Is there any equivalent way in Spark?
>
> Thanks
> Chang
>
>
> On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan  wrote:
>
>> It’s not about case when, but about rand(). Non-deterministic expressions
>> are not allowed in join condition.
>>
>> > On 13 Jul 2017, at 6:43 PM, wangshuang  wrote:
>> >
>> > I'm trying to execute hive sql on spark sql (Also on spark
>> thriftserver), For
>> > optimizing data skew, we use "case when" to handle null.
>> > Simple sql as following:
>> >
>> >
>> > SELECT a.col1
>> > FROM tbl1 a
>> > LEFT OUTER JOIN tbl2 b
>> > ON
>> > * CASE
>> >   WHEN a.col2 IS NULL
>> >   TNEN cast(rand(9)*1000 - 99 as string)
>> >   ELSE
>> >   a.col2 END *
>> >   = b.col3;
>> >
>> >
>> > But I get the error:
>> >
>> > == Physical Plan ==
>> > *org.apache.spark.sql.AnalysisException: nondeterministic expressions
>> are
>> > only allowed in
>> > Project, Filter, Aggregate or Window, found:*
>> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) * CAST(1000 AS
>> > DOUBLE)) - CAST(99L AS DOUBLE)) AS STRING) ELSE a.`nav_tcdt`
>> END =
>> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND
>> (c.`cur_flag` =
>> > 1))
>> > in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
>> > cast(((rand(9) * cast(1000 as double)) - cast(99 as double)) as
>> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && (cast(nav_tcd#26 as
>> int)
>> > = 9)) && (cur_flag#77 = 1))
>> >   ;;
>> > GlobalLimit 10
>> > +- LocalLimit 10
>> >   +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as string) IN
>> > (cast(19596 as string),cast(20134 as string),cast(10997 as string)) &&
>> > nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE nav_tpa_id#21 END],
>> > [date_id#7]
>> >  +- Filter (date_id#7 = 2017-07-12)
>> > +- Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
>> > cast(((rand(9) * cast(1000 as double)) - cast(99 as double)) as
>> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && (cast(nav_tcd#26 as
>> int)
>> > = 9)) && (cur_flag#77 = 1))
>> >:- SubqueryAlias a
>> >:  +- SubqueryAlias tmp_lifan_trfc_tpa_hive
>> >: +- CatalogRelation `tmp`.`tmp_lifan_trfc_tpa_hive`,
>> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [date_id#7,
>> chanl_id#8L,
>> > pltfm_id#9, city_id#10, sessn_id#11, gu_id#12,
>> nav_refer_page_type_id#13,
>> > nav_refer_page_value#14, nav_refer_tpa#15, nav_refer_tpa_id#16,
>> > nav_refer_tpc#17, nav_refer_tpi#18, nav_page_type_id#19,
>> nav_page_value#20,
>> > nav_tpa_id#21, nav_tpa#22, nav_tpc#23, nav_tpi#24, nav_tcdt#25,
>> nav_tcd#26,
>> > nav_tci#27, nav_tce#28, detl_refer_page_type_id#29,
>> > detl_refer_page_value#30, ... 33 more fields]
>> >+- SubqueryAlias c
>> >   +- SubqueryAlias dim_site_categ_ext
>> >  +- CatalogRelation `dw`.`dim_site_categ_ext`,
>> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
>> [site_categ_skid#64L,
>> > site_categ_type#65, site_categ_code#66, site_categ_name#67,
>> > site_categ_parnt_skid#68L, site_categ_kywrd#69, leaf_flg#70L,
>> sort_seq#71L,
>> > site_categ_srch_name#72, vsbl_flg#73, delet_flag#74, etl_batch_id#75L,
>> > updt_time#76, cur_flag#77, bkgrnd_categ_skid#78L, bkgrnd_categ_id#79L,
>> > site_categ_id#80, site_categ_parnt_id#81]
>> >
>> > Does spark sql not support syntax "case when" in JOIN?  Additional, my
>> spark
>> > version is 2.2.0.
>> > Any help would be greatly appreciated.
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-spark-developers
>> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be-
>> supported-in-JOIN-tp21953.html
>> > Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


-- 

Herman van Hövell

Software Engineer

Databricks Inc.

hvanhov...@databricks.com

+31 6 420 590 27

databricks.com

[image: http://databricks.com] 



[image: Announcing Databricks Serverless. The first serverless data science
and big data platform. Watch the demo from Spark Summit 2017.]



Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-13 Thread Chang Chen
Hi Wenchen

Yes. We also find this error is caused by Rand. However, this is classic
way to solve data skew in Hive.  Is there any equivalent way in Spark?

Thanks
Chang

On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan  wrote:

> It’s not about case when, but about rand(). Non-deterministic expressions
> are not allowed in join condition.
>
> > On 13 Jul 2017, at 6:43 PM, wangshuang  wrote:
> >
> > I'm trying to execute hive sql on spark sql (Also on spark
> thriftserver), For
> > optimizing data skew, we use "case when" to handle null.
> > Simple sql as following:
> >
> >
> > SELECT a.col1
> > FROM tbl1 a
> > LEFT OUTER JOIN tbl2 b
> > ON
> > * CASE
> >   WHEN a.col2 IS NULL
> >   TNEN cast(rand(9)*1000 - 99 as string)
> >   ELSE
> >   a.col2 END *
> >   = b.col3;
> >
> >
> > But I get the error:
> >
> > == Physical Plan ==
> > *org.apache.spark.sql.AnalysisException: nondeterministic expressions
> are
> > only allowed in
> > Project, Filter, Aggregate or Window, found:*
> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) * CAST(1000 AS
> > DOUBLE)) - CAST(99L AS DOUBLE)) AS STRING) ELSE a.`nav_tcdt` END
> =
> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND (c.`cur_flag`
> =
> > 1))
> > in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
> > cast(((rand(9) * cast(1000 as double)) - cast(99 as double)) as
> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && (cast(nav_tcd#26 as
> int)
> > = 9)) && (cur_flag#77 = 1))
> >   ;;
> > GlobalLimit 10
> > +- LocalLimit 10
> >   +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as string) IN
> > (cast(19596 as string),cast(20134 as string),cast(10997 as string)) &&
> > nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE nav_tpa_id#21 END],
> > [date_id#7]
> >  +- Filter (date_id#7 = 2017-07-12)
> > +- Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
> > cast(((rand(9) * cast(1000 as double)) - cast(99 as double)) as
> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && (cast(nav_tcd#26 as
> int)
> > = 9)) && (cur_flag#77 = 1))
> >:- SubqueryAlias a
> >:  +- SubqueryAlias tmp_lifan_trfc_tpa_hive
> >: +- CatalogRelation `tmp`.`tmp_lifan_trfc_tpa_hive`,
> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [date_id#7,
> chanl_id#8L,
> > pltfm_id#9, city_id#10, sessn_id#11, gu_id#12, nav_refer_page_type_id#13,
> > nav_refer_page_value#14, nav_refer_tpa#15, nav_refer_tpa_id#16,
> > nav_refer_tpc#17, nav_refer_tpi#18, nav_page_type_id#19,
> nav_page_value#20,
> > nav_tpa_id#21, nav_tpa#22, nav_tpc#23, nav_tpi#24, nav_tcdt#25,
> nav_tcd#26,
> > nav_tci#27, nav_tce#28, detl_refer_page_type_id#29,
> > detl_refer_page_value#30, ... 33 more fields]
> >+- SubqueryAlias c
> >   +- SubqueryAlias dim_site_categ_ext
> >  +- CatalogRelation `dw`.`dim_site_categ_ext`,
> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
> [site_categ_skid#64L,
> > site_categ_type#65, site_categ_code#66, site_categ_name#67,
> > site_categ_parnt_skid#68L, site_categ_kywrd#69, leaf_flg#70L,
> sort_seq#71L,
> > site_categ_srch_name#72, vsbl_flg#73, delet_flag#74, etl_batch_id#75L,
> > updt_time#76, cur_flag#77, bkgrnd_categ_skid#78L, bkgrnd_categ_id#79L,
> > site_categ_id#80, site_categ_parnt_id#81]
> >
> > Does spark sql not support syntax "case when" in JOIN?  Additional, my
> spark
> > version is 2.2.0.
> > Any help would be greatly appreciated.
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-developers
> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-
> be-supported-in-JOIN-tp21953.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-13 Thread Wenchen Fan
It’s not about case when, but about rand(). Non-deterministic expressions are 
not allowed in join condition.

> On 13 Jul 2017, at 6:43 PM, wangshuang  wrote:
> 
> I'm trying to execute hive sql on spark sql (Also on spark thriftserver), For
> optimizing data skew, we use "case when" to handle null.
> Simple sql as following:
> 
> 
> SELECT a.col1 
> FROM tbl1 a 
> LEFT OUTER JOIN tbl2 b 
> ON 
> * CASE 
>   WHEN a.col2 IS NULL 
>   TNEN cast(rand(9)*1000 - 99 as string) 
>   ELSE 
>   a.col2 END *
>   = b.col3;
> 
> 
> But I get the error:
> 
> == Physical Plan ==
> *org.apache.spark.sql.AnalysisException: nondeterministic expressions are
> only allowed in
> Project, Filter, Aggregate or Window, found:*
> (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) * CAST(1000 AS
> DOUBLE)) - CAST(99L AS DOUBLE)) AS STRING) ELSE a.`nav_tcdt` END =
> c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND (c.`cur_flag` =
> 1))
> in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
> cast(((rand(9) * cast(1000 as double)) - cast(99 as double)) as
> string) ELSE nav_tcdt#25 END = site_categ_id#80) && (cast(nav_tcd#26 as int)
> = 9)) && (cur_flag#77 = 1))
>   ;;
> GlobalLimit 10
> +- LocalLimit 10
>   +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as string) IN
> (cast(19596 as string),cast(20134 as string),cast(10997 as string)) &&
> nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE nav_tpa_id#21 END],
> [date_id#7]
>  +- Filter (date_id#7 = 2017-07-12)
> +- Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
> cast(((rand(9) * cast(1000 as double)) - cast(99 as double)) as
> string) ELSE nav_tcdt#25 END = site_categ_id#80) && (cast(nav_tcd#26 as int)
> = 9)) && (cur_flag#77 = 1))
>:- SubqueryAlias a
>:  +- SubqueryAlias tmp_lifan_trfc_tpa_hive
>: +- CatalogRelation `tmp`.`tmp_lifan_trfc_tpa_hive`,
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [date_id#7, chanl_id#8L,
> pltfm_id#9, city_id#10, sessn_id#11, gu_id#12, nav_refer_page_type_id#13,
> nav_refer_page_value#14, nav_refer_tpa#15, nav_refer_tpa_id#16,
> nav_refer_tpc#17, nav_refer_tpi#18, nav_page_type_id#19, nav_page_value#20,
> nav_tpa_id#21, nav_tpa#22, nav_tpc#23, nav_tpi#24, nav_tcdt#25, nav_tcd#26,
> nav_tci#27, nav_tce#28, detl_refer_page_type_id#29,
> detl_refer_page_value#30, ... 33 more fields]
>+- SubqueryAlias c
>   +- SubqueryAlias dim_site_categ_ext
>  +- CatalogRelation `dw`.`dim_site_categ_ext`,
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [site_categ_skid#64L,
> site_categ_type#65, site_categ_code#66, site_categ_name#67,
> site_categ_parnt_skid#68L, site_categ_kywrd#69, leaf_flg#70L, sort_seq#71L,
> site_categ_srch_name#72, vsbl_flg#73, delet_flag#74, etl_batch_id#75L,
> updt_time#76, cur_flag#77, bkgrnd_categ_skid#78L, bkgrnd_categ_id#79L,
> site_categ_id#80, site_categ_parnt_id#81]
> 
> Does spark sql not support syntax "case when" in JOIN?  Additional, my spark
> version is 2.2.0.
> Any help would be greatly appreciated.
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be-supported-in-JOIN-tp21953.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org