Barry Becker created SPARK-17219:
------------------------------------

             Summary: QuantileDiscretizer does strange things with NaN values
                 Key: SPARK-17219
                 URL: https://issues.apache.org/jira/browse/SPARK-17219
             Project: Spark
          Issue Type: Bug
          Components: ML
    Affects Versions: 1.6.2
            Reporter: Barry Becker


How is the QuantileDiscretizer supposed to handle null values?
Actual nulls are not allowed, so I replace them with Double.NaN.
However, when you try to run the QuantileDiscretizer on a column that contains 
NaNs, it will create (possibly more than one) NaN split(s) before the final 
PositiveInfinity value.
I am using the attache titanic csv data and trying to bin the "age" column 
using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
of null values.
These are the splits that I get:
{code}
-Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
{code}
Is that expected. It seems to imply that NaN is larger than any positive number 
and less than infinity.
I'm not sure of the best way to handle nulls, but I think they need a bucket 
all their own. My suggestions would be to include an initial NaN split value 
that is always there, just like the sentinel Infinities are. If that were the 
case, then the splits for the example above might look like this:
{code}
NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
{code}
This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
make much sense. Not sure if the NaN bucket counts toward numBins or not. I do 
think it should always be there though in case future data has null even though 
the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to