[ https://issues.apache.org/jira/browse/SPARK-20711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zhengruifeng updated SPARK-20711: --------------------------------- Description: {code} scala> val summarizer = new MultivariateOnlineSummarizer() summarizer: org.apache.spark.mllib.stat.MultivariateOnlineSummarizer = org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d scala> summarizer.add(Vectors.dense(Double.NaN, -10.0)) res20: summarizer.type = org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d scala> summarizer.add(Vectors.dense(Double.NaN, 2.0)) res21: summarizer.type = org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d scala> summarizer.min res22: org.apache.spark.mllib.linalg.Vector = [1.7976931348623157E308,-10.0] scala> summarizer.max res23: org.apache.spark.mllib.linalg.Vector = [-1.7976931348623157E308,2.0] {code} For a feature only containing {{Double.NaN}}, the returned max is {{Double.MinValue}} and the min is {{Double.MaxValue}}. {code} import org.apache.spark.ml.stat._ val df = spark.createDataFrame(Seq( (1, 2.3, Vectors.dense(Double.NaN, 0.0, 0.0)), (2, 6.7, Vectors.dense(Double.NaN, 0.2, 0.0)) )).toDF("id", "num", "features") df.select(Summarizer.metrics("mean").summary(col("features"))).head res2: org.apache.spark.sql.Row = [[WrappedArray(NaN, 0.1, 0.0)]] df.select(Summarizer.metrics("min").summary(col("features"))).head res3: org.apache.spark.sql.Row = [[WrappedArray(1.7976931348623157E308, 0.0, 0.0)]] {code} was: {code} scala> val summarizer = new MultivariateOnlineSummarizer() summarizer: org.apache.spark.mllib.stat.MultivariateOnlineSummarizer = org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d scala> summarizer.add(Vectors.dense(Double.NaN, -10.0)) res20: summarizer.type = org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d scala> summarizer.add(Vectors.dense(Double.NaN, 2.0)) res21: summarizer.type = org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d scala> summarizer.min res22: org.apache.spark.mllib.linalg.Vector = [1.7976931348623157E308,-10.0] scala> summarizer.max res23: org.apache.spark.mllib.linalg.Vector = [-1.7976931348623157E308,2.0] {code} For a feature only containing {{Double.NaN}}, the returned max is {{Double.MinValue}} and the min is {{Double.MaxValue}}. > MultivariateOnlineSummarizer incorrect min/max for NaN value > ------------------------------------------------------------ > > Key: SPARK-20711 > URL: https://issues.apache.org/jira/browse/SPARK-20711 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.2.0 > Reporter: zhengruifeng > Priority: Minor > > {code} > scala> val summarizer = new MultivariateOnlineSummarizer() > summarizer: org.apache.spark.mllib.stat.MultivariateOnlineSummarizer = > org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d > scala> summarizer.add(Vectors.dense(Double.NaN, -10.0)) > res20: summarizer.type = > org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d > scala> summarizer.add(Vectors.dense(Double.NaN, 2.0)) > res21: summarizer.type = > org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d > scala> summarizer.min > res22: org.apache.spark.mllib.linalg.Vector = [1.7976931348623157E308,-10.0] > scala> summarizer.max > res23: org.apache.spark.mllib.linalg.Vector = [-1.7976931348623157E308,2.0] > {code} > For a feature only containing {{Double.NaN}}, the returned max is > {{Double.MinValue}} and the min is {{Double.MaxValue}}. > {code} > import org.apache.spark.ml.stat._ > val df = spark.createDataFrame(Seq( > (1, 2.3, Vectors.dense(Double.NaN, 0.0, 0.0)), > (2, 6.7, Vectors.dense(Double.NaN, 0.2, 0.0)) > )).toDF("id", "num", "features") > df.select(Summarizer.metrics("mean").summary(col("features"))).head > res2: org.apache.spark.sql.Row = [[WrappedArray(NaN, 0.1, 0.0)]] > df.select(Summarizer.metrics("min").summary(col("features"))).head > res3: org.apache.spark.sql.Row = [[WrappedArray(1.7976931348623157E308, 0.0, > 0.0)]] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org