[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

Sean Owen (JIRA) Wed, 24 Aug 2016 13:12:32 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15435588#comment-15435588
 ]


Sean Owen commented on SPARK-17219:
-----------------------------------

Yes, those seem like the 3 options. Hm, I'm reluctant to introduce another set 
of choices here and would favor just being opinionated on this point. 

How about establishing an additional bucket? the question I suppose is how that 
relates to the requested number of buckets.  I think it would be in addition. 
So requesting 10 buckets means potentially 11 come back.

> QuantileDiscretizer does strange things with NaN values
> -------------------------------------------------------
>
>                 Key: SPARK-17219
>                 URL: https://issues.apache.org/jira/browse/SPARK-17219
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 1.6.2
>            Reporter: Barry Becker
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

Reply via email to