Neil Dewar created SPARK-16468: ---------------------------------- Summary: Confusing results when describe() used on DataFrame with chr columns Key: SPARK-16468 URL: https://issues.apache.org/jira/browse/SPARK-16468 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.6.1 Environment: Databricks.com Reporter: Neil Dewar Priority: Minor
The describe() function returns statistical summaries on numeric columns of a DataFrame. If the DataFrame contains columns of type chr, only the count, min and max stats are returned. When a dataframe contains a mixture of numeric and chr columns, the results become jumbled together. Example: sdfR <- createDataFrame(sqlContext, ToothGrowth) collect(describe(sdfR)) Results: summary len supp dose 1 count 60 60 60 2 mean 18.813333333333336 1.1666666666666667 3 stddev 7.649315171887615 0.6288721857330792 4 min 4.2 OJ 0.5 5 max 33.9 VC 2.0 There appear to be two problems here: (1) The mean and stdev values have not been rounded for the columns where there are valid values (2) There is no ability to distinguish that the supp column has no values in mean and stdev rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org