[ https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15553503#comment-15553503 ]
Sean Owen commented on SPARK-17219: ----------------------------------- I could go this way too. I ended up sympathizing with trying to return a result rather than refusing. to the extent NaN is a corner case, then, the behavior in this corner case isn't going to impact mainstream usage. And you can ignore / reject NaN before calling this if you want. Stare decisis, I figure, unless anyone feels strongly about it. It's only in master at this point, and unreleased. > QuantileDiscretizer does strange things with NaN values > ------------------------------------------------------- > > Key: SPARK-17219 > URL: https://issues.apache.org/jira/browse/SPARK-17219 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 1.6.2 > Reporter: Barry Becker > Assignee: Vincent > Fix For: 2.1.0 > > > How is the QuantileDiscretizer supposed to handle null values? > Actual nulls are not allowed, so I replace them with Double.NaN. > However, when you try to run the QuantileDiscretizer on a column that > contains NaNs, it will create (possibly more than one) NaN split(s) before > the final PositiveInfinity value. > I am using the attache titanic csv data and trying to bin the "age" column > using the QuantileDiscretizer with 10 bins specified. The age column as a lot > of null values. > These are the splits that I get: > {code} > -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity > {code} > Is that expected. It seems to imply that NaN is larger than any positive > number and less than infinity. > I'm not sure of the best way to handle nulls, but I think they need a bucket > all their own. My suggestions would be to include an initial NaN split value > that is always there, just like the sentinel Infinities are. If that were the > case, then the splits for the example above might look like this: > {code} > NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity > {code} > This does not seem great either because a bucket that is [NaN, -Inf] doesn't > make much sense. Not sure if the NaN bucket counts toward numBins or not. I > do think it should always be there though in case future data has null even > though the fit data did not. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org