[
https://issues.apache.org/jira/browse/SPARK-15660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-15660:
----------------------------------
Description:
In Spark-11490, `variance/stdev` are redefined as the **sample**
`variance/stdev` instead of population ones.
This issue addresses the only remaining legacy in RDD. This may cause breaking
changes, but we had better be consistent in Spark 2.0 if possible.
{code}
scala> val ds = spark.createDataset(Seq(1.0, 2.0, 3.0))
ds: org.apache.spark.sql.Dataset[Double] = [value: double]
scala> ds.describe().show()
+-------+-----+
|summary|value|
+-------+-----+
| count| 3|
| mean| 2.0|
| stddev| 1.0|
| min| 1.0|
| max| 3.0|
+-------+-----+
scala> ds.rdd.stdev
res1: Double = 0.816496580927726
{code}
was:
In Spark-11490, `variance/stdev` are redefined as the **sample**
`variance/stdev` instead of population ones.
This issue addresses the only remaining legacy in RDD. This may cause breaking
changes, but we had better be consistent in Spark 2.0 if possible.
{code}
scala> val rdd = sc.parallelize(Seq(1.0, 2.0, 3.0))
rdd: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[0] at parallelize
at <console>:24
scala> rdd.stdev
res0: Double = 0.816496580927726
scala> rdd.toDS().describe().show()
16/05/30 22:20:12 WARN ObjectStore: Version information not found in metastore.
hive.metastore.schema.verification is not enabled so recording the schema
version 1.2.0
16/05/30 22:20:12 WARN ObjectStore: Failed to get database default, returning
NoSuchObjectException
+-------+-----+
|summary|value|
+-------+-----+
| count| 3|
| mean| 2.0|
| stddev| 1.0|
| min| 1.0|
| max| 3.0|
+-------+-----+
{code}
> RDD and Dataset should show the consistent value for variance/stdev.
> --------------------------------------------------------------------
>
> Key: SPARK-15660
> URL: https://issues.apache.org/jira/browse/SPARK-15660
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Reporter: Dongjoon Hyun
>
> In Spark-11490, `variance/stdev` are redefined as the **sample**
> `variance/stdev` instead of population ones.
> This issue addresses the only remaining legacy in RDD. This may cause
> breaking changes, but we had better be consistent in Spark 2.0 if possible.
> {code}
> scala> val ds = spark.createDataset(Seq(1.0, 2.0, 3.0))
> ds: org.apache.spark.sql.Dataset[Double] = [value: double]
> scala> ds.describe().show()
> +-------+-----+
>
> |summary|value|
> +-------+-----+
> | count| 3|
> | mean| 2.0|
> | stddev| 1.0|
> | min| 1.0|
> | max| 3.0|
> +-------+-----+
> scala> ds.rdd.stdev
> res1: Double = 0.816496580927726
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]