Neil Dewar created SPARK-16468:
----------------------------------

             Summary: Confusing results when describe() used on DataFrame with 
chr columns
                 Key: SPARK-16468
                 URL: https://issues.apache.org/jira/browse/SPARK-16468
             Project: Spark
          Issue Type: Bug
          Components: SparkR
    Affects Versions: 1.6.1
         Environment: Databricks.com
            Reporter: Neil Dewar
            Priority: Minor


The describe() function returns statistical summaries on numeric columns of a 
DataFrame.  If the DataFrame contains columns of type chr, only the count, min 
and max stats are returned.

When a dataframe contains a mixture of numeric and chr columns, the results 
become jumbled together.

Example:
sdfR <- createDataFrame(sqlContext, ToothGrowth)
collect(describe(sdfR))

Results:
   summary                len supp               dose
1   count                 60   60                 60
2    mean 18.813333333333336  1.1666666666666667
3  stddev  7.649315171887615  0.6288721857330792
4     min                4.2   OJ                0.5
5     max               33.9   VC                2.0

There appear to be two problems here:
(1) The mean and stdev values have not been rounded for the columns where there 
are valid values
(2) There is no ability to distinguish that the supp column has no values in 
mean and stdev rows.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to