[jira] [Comment Edited] (SPARK-16283) Implement percentile_approx SQL function

2017-03-09 Thread Zhenhua Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904397#comment-15904397
 ] 

Zhenhua Wang edited comment on SPARK-16283 at 3/10/17 4:09 AM:
---

[~erlu] I think it's been made clear from the above discussions, Spark's result 
doesn't have to be the same as Hive's result.


was (Author: zenwzh):
[~erlu] I think it's been made clear from above discussions, Spark' result 
doesn't have to be the same as Hive's result.

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Sean Zhong
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16283) Implement percentile_approx SQL function

2017-03-08 Thread chenerlu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901132#comment-15901132
 ] 

chenerlu edited comment on SPARK-16283 at 3/9/17 1:55 AM:
--

Hi, I am little confused about percentile_approx, is it different from hive's 
now ? will we get different result when the input is same ?

for example, I run select percentile_approx(c4_double,array(0.1,0.2,0.3,0.4)) 
from test; and get different result.

c4_double is show below:
1.0001
2.0001
3.0001
4.0001
5.0001
6.0001
7.0001
8.0001
9.0001
NULL
-8.952
-96.0

Hive:
[-87.2952,-6.9615799,1.30009998,2.40010003]

spark 2.x:
[-8.952,1.0001,2.0001,3.0001]

so which result is right ? Could you pls reply me when you are free.

[~rxin] [~lwlin]


was (Author: erlu):
Hi, I am little confused about percentile_approx, is it different from hive's 
now ? will we get different result when the input is same ?

for example, I run select percentile_approx(c4_double,array(0.1,0.2,0.3,0.4)) 
from test; and get different result.

c4_double is show below:
1.0001
2.0001
3.0001
4.0001
5.0001
6.0001
7.0001
8.0001
9.0001
NULL
-8.952
-96.0

Hive:
[-87.2952,-6.9615799,1.30009998,2.40010003]

spark 2.x:
[-8.952,1.0001,2.0001,3.0001]

so which result is right ? Could you pls reply me when you are free.

[~rxin] [~linwei]

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Sean Zhong
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16283) Implement percentile_approx SQL function

2017-03-08 Thread chenerlu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901132#comment-15901132
 ] 

chenerlu edited comment on SPARK-16283 at 3/9/17 1:55 AM:
--

Hi, I am little confused about percentile_approx, is it different from hive's 
now ? will we get different result when the input is same ?

for example, I run select percentile_approx(c4_double,array(0.1,0.2,0.3,0.4)) 
from test; and get different result.

c4_double is show below:
1.0001
2.0001
3.0001
4.0001
5.0001
6.0001
7.0001
8.0001
9.0001
NULL
-8.952
-96.0

Hive:
[-87.2952,-6.9615799,1.30009998,2.40010003]

spark 2.x:
[-8.952,1.0001,2.0001,3.0001]

so which result is right ? Could you pls reply me when you are free.

[~rxin] [~linwei]


was (Author: erlu):
Hi, I am little confused about percentile_approx, is it different from hive's 
now ? will we get different result when the input is same ?

for example, I run select percentile_approx(c4_double,array(0.1,0.2,0.3,0.4)) 
from test; and get different result.

c4_double is show below:
1.0001
2.0001
3.0001
4.0001
5.0001
6.0001
7.0001
8.0001
9.0001
NULL
-8.952
-96.0

Hive:
[-87.2952,-6.9615799,1.30009998,2.40010003]

spark 2.x:
[-8.952,1.0001,2.0001,3.0001]

so which result is right ? Could you pls reply me when you are free.



> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Sean Zhong
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16283) Implement percentile_approx SQL function

2016-08-22 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431136#comment-15431136
 ] 

Sean Zhong edited comment on SPARK-16283 at 8/22/16 4:35 PM:
-

Created a sub-task SPARK-17188 to move QuantileSummaries to package 
org.apache.spark.sql.util of catalyst project


was (Author: clockfly):
Created a sub-task to move QuantileSummaries to package 
org.apache.spark.sql.util of catalyst project

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16283) Implement percentile_approx SQL function

2016-07-13 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374295#comment-15374295
 ] 

Liwei Lin edited comment on SPARK-16283 at 7/13/16 6:05 AM:


Hive's percentile_approx implementation computes approximate percentile values 
from a histogram (please refer to 
[Hive/GenericUDAFPercentileApprox.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java]
 and 
[Hive/NumericHistogram.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java]
 for details):
- Hive's percentile_approx's signature is: {{\_FUNC\_(expr, pc, \[nb\])}}
- parameter \[nb\] -- the number of histogram bins to use -- is optionally 
specified by users
- if the number of unique values in the actual dataset is less than or equals 
to this \[nb\], we can expect an exact result; otherwise there are no 
approximation guarantees


Our Dataset's approxQuantile() implementation is not really histogram-based 
(and thus differs from Hive's implementation):
- our Dataset's approxQuantile()'s signature is something like: 
{{\_FUNC\_(expr, pc, relativeError)}}
- parameter relativeError is specified by users and should be in \[0, 1\]; our 
approximation is deterministicly bounded by this relativeError -- please refer 
to 
[Spark/DataFrameStatFunctions.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L39]
 for details


Since there's no direct deterministic relationship between \[nb\] and 
relativeError, it seems hard to build Hive's percentile_approx on top of our 
Dataset's approxQuantile(). So should we: (a) port Hive' implementation into 
Spark, and provide {{\_FUNC\_(expr, pc, \[nb\])}} on top of it, or (b) provide 
{{\_FUNC\_(expr, pc, relativeError)}} directly on top of our Dataset's 
approxQuantile() implementation, but this might be incompatible with Hive? 
[~rxin], [~thunterdb] could you share some thoughts? Thanks !


was (Author: proflin):
Hive's percentile_approx implementation computes approximate percentile values 
from a histogram (please refer to 
[Hive/GenericUDAFPercentileApprox.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java]
 and 
[Hive/NumericHistogram.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java]
 for details):
- Hive's percentile_approx's signature is: {{\_FUNC\_(expr, pc, \[nb\])}}
- parameter \[nb\] -- the number of histogram bins to use -- is optionally 
specified by users
- if the number of unique values in the actual dataset is less than or equals 
to this \[nb\], we can expect an exact result; otherwise there are no 
approximation guarantees


Our Dataset's approxQuantile() implementation is not really histogram-based 
(and thus differs from Hive's implementation):
- our Dataset's approxQuantile()'s signature is something like: 
{{\_FUNC\_(expr, pc, relativeError)}}
- parameter relativeError is specified by users and should be in \[0, 1\]; our 
approximation is deterministicly bounded by this relativeError -- please refer 
to 
[Spark/DataFrameStatFunctions.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L39]
 for details


Since there's no direct deterministic relationship between \[nb\] and 
relativeError, it seems hard to build Hive's percentile_approx on top of our 
Dataset's approxQuantile(). So, [~rxin], [~thunterdb], should we: (a) port 
Hive' implementation into Spark, and provide {{\_FUNC\_(expr, pc, \[nb\])}} on 
top of it, or (b) provide {{\_FUNC\_(expr, pc, relativeError)}} directly on top 
of our Dataset's approxQuantile() implementation, but this might be 
incompatible with Hive? Could you share some thoughts? Thanks !

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16283) Implement percentile_approx SQL function

2016-07-12 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374295#comment-15374295
 ] 

Liwei Lin edited comment on SPARK-16283 at 7/13/16 4:01 AM:


Hive's percentile_approx implementation computes approximate percentile values 
from a histogram (please refer to 
[Hive/GenericUDAFPercentileApprox.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java]
 and 
[Hive/NumericHistogram.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java]
 for details):
- Hive's percentile_approx's signature is: {{\_FUNC\_(expr, pc, \[nb\])}}
- parameter \[nb\] -- the number of histogram bins to use -- is optionally 
specified by users
- if the number of unique values in the actual dataset is less than or equals 
to this \[nb\], we can expect an exact result; otherwise there are no 
approximation guarantees


Our Dataset's approxQuantile() implementation is not really histogram-based 
(and thus differs from Hive's implementation):
- our Dataset's approxQuantile()'s signature is something like: 
{{\_FUNC\_(expr, pc, relativeError)}}
- parameter relativeError is specified by users and should be in \[0, 1\]; our 
approximation is deterministicly bounded by this relativeError -- please refer 
to 
[Spark/DataFrameStatFunctions.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L39]
 for details


Since there's no direct deterministic relationship between \[nb\] and 
relativeError, it seems hard to build Hive's percentile_approx on top of our 
Dataset's approxQuantile(). So, [~rxin], [~thunterdb], should we: (a) port 
Hive' implementation into Spark, and provide {{\_FUNC\_(expr, pc, \[nb\])}} on 
top of it, or (b) provide {{\_FUNC\_(expr, pc, relativeError)}} directly on top 
of our Dataset's approxQuantile() implementation, but this might be 
incompatible with Hive? Could you share some thoughts? Thanks !


was (Author: proflin):
Hive's percentile_approx implementation computes approximate percentile values 
from a histogram (please refer to 
[Hive/GenericUDAFPercentileApprox.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java]
 and 
[Hive/NumericHistogram.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java]
 for details):
- Hive's percentile_approx's signature is: {{\_FUNC\_(expr, pc, \[nb\])}}
- parameter \[nb\] -- the number of histogram bins to use -- is optionally 
specified by users
- if the number of unique values in the actual dataset is less than or equals 
to this \[nb\], we can expect an exact result; otherwise there are no 
approximation guarantees


Our Dataset's approxQuantile() implementation is not really histogram-based 
(and thus differs from Hive's implementation):
- our Dataset's approxQuantile()'s signature is something like: 
{{\_FUNC\_(expr, pc, relativeError)}}
- parameter relativeError is specified by users and should be in \[0, 1\]; our 
approximation is deterministicly bounded by this relativeError -- please refer 
to 
[Spark/DataFrameStatFunctions.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L39]
 for details


Since there's no direct deterministic relationship between \[nb\] and 
relativeError, it seems hard to build Hive's percentile_approx on top of our 
Dataset's approxQuantile(). So, [~rxin], [~thunterdb], should we: (a) port 
Hive' implementation into Spark, and provide {{\_FUNC\_(expr, pc, \[nb\])}} on 
top of it, or (b) provide {{\_FUNC\_(expr, pc, relativeError)}} directly on top 
of our Dataset's approxQuantile() implementation, but this might be 
incompatible with Hive? Thanks !

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org