zhengruifeng created SPARK-21879:
------------------------------------

             Summary: Should Scalers handel NaN values?
                 Key: SPARK-21879
                 URL: https://issues.apache.org/jira/browse/SPARK-21879
             Project: Spark
          Issue Type: Question
          Components: ML
    Affects Versions: 2.3.0
            Reporter: zhengruifeng


The way {{ML.Scalers}} handling {{NaN}} is somewhat unexpected. Current impl of 
{{MinMaxScaler}}/{{MaxAbsScaler}}/{{StandardScaler}} all support {{fit}} and 
{{transform}} on a dataset containing {{NaN}}.
Note that values in the second column in the following dataframe are all 
{{NaN}}, and the coefficients of {{min/max}} in {{MinMaxScalerModel}} and 
{{maxAbs}} in {{MaxAbsScaler}} are wrong.
{code}
import org.apache.spark.ml.feature._
import org.apache.spark.ml.linalg.{Vector, Vectors}

scala> val data = Array(
     |       Vectors.dense(1, Double.NaN, Double.NaN, 2.0),
     |       Vectors.dense(2, Double.NaN, 0.0, 3.0),
     |       Vectors.dense(3, Double.NaN, 0.0, 1.0),
     |       Vectors.dense(6, Double.NaN, 2.0, Double.NaN)).zipWithIndex
data: Array[(org.apache.spark.ml.linalg.Vector, Int)] = 
Array(([1.0,NaN,NaN,2.0],0), ([2.0,NaN,0.0,3.0],1), ([3.0,NaN,0.0,1.0],2), 
([6.0,NaN,2.0,NaN],3))

scala> val df = data.toSeq.toDF("features", "id")
df: org.apache.spark.sql.DataFrame = [features: vector, id: int]

scala> val scaler = new 
MinMaxScaler().setInputCol("features").setOutputCol("scaled")
scaler: org.apache.spark.ml.feature.MinMaxScaler = minMaxScal_7634802f5c81
        
scala> val model = scaler.fit(df)
model: org.apache.spark.ml.feature.MinMaxScalerModel = minMaxScal_7634802f5c81

scala> model.originalMax
res1: org.apache.spark.ml.linalg.Vector = [6.0,-1.7976931348623157E308,2.0,3.0]

scala> model.originalMin
res2: org.apache.spark.ml.linalg.Vector = [1.0,1.7976931348623157E308,0.0,1.0]

scala> model.transform(df).select("scaled").collect
res3: Array[org.apache.spark.sql.Row] = Array([[0.0,NaN,NaN,0.5]], 
[[0.2,NaN,0.0,1.0]], [[0.4,NaN,0.0,0.0]], [[1.0,NaN,1.0,NaN]])



scala> val scaler2 = new 
MaxAbsScaler().setInputCol("features").setOutputCol("scaled")
scaler2: org.apache.spark.ml.feature.MaxAbsScaler = maxAbsScal_5d34fa818229

scala> val model2 = scaler2.fit(df)
model2: org.apache.spark.ml.feature.MaxAbsScalerModel = maxAbsScal_5d34fa818229

scala> model2.maxAbs
res4: org.apache.spark.ml.linalg.Vector = [6.0,1.7976931348623157E308,2.0,3.0]

scala> model2.transform(df).select("scaled").collect
res5: Array[org.apache.spark.sql.Row] = 
Array([[0.16666666666666666,NaN,NaN,0.6666666666666666]], 
[[0.3333333333333333,NaN,0.0,1.0]], [[0.5,NaN,0.0,0.3333333333333333]], 
[[1.0,NaN,1.0,NaN]])


scala> val scaler3 = new 
StandardScaler().setInputCol("features").setOutputCol("scaled")
scaler3: org.apache.spark.ml.feature.StandardScaler = stdScal_d8509095e860

scala> val model3 = scaler3.fit(df)
model3: org.apache.spark.ml.feature.StandardScalerModel = stdScal_d8509095e860

scala> model3.std
res11: org.apache.spark.ml.linalg.Vector = [2.160246899469287,NaN,NaN,NaN]

scala> model3.mean
res12: org.apache.spark.ml.linalg.Vector = [3.0,NaN,NaN,NaN]

scala> model3.transform(df).select("scaled").collect
res14: Array[org.apache.spark.sql.Row] = 
Array([[0.4629100498862757,NaN,NaN,NaN]], [[0.9258200997725514,NaN,NaN,NaN]], 
[[1.3887301496588271,NaN,NaN,NaN]], [[2.7774602993176543,NaN,NaN,NaN]])

{code}

I then test the scalers in scikit-learn, and they all throw exceptions in both 
{{fit}} and {{transform}}.

{code}
import numpy as np

from sklearn.preprocessing import *

data = np.array([[-1, 2], [-0.5, 6], [0, np.nan], [1, 1.8]])

data2 = np.array([[-1, 2], [-0.5, 6], [0, 2.0], [1, 1.8]])

for scaler in [StandardScaler(), MinMaxScaler(), MaxAbsScaler(), 
RobustScaler()]:
    try:
        scaler.fit(data)
    except:
        print('{0}.fit fails'.format(scaler))
    model = scaler.fit(data2)
    try:
        model.transform(data)
    except:
        print('{0}.transform fails'.format(scaler))
        
StandardScaler(copy=True, with_mean=True, with_std=True).fit fails
StandardScaler(copy=True, with_mean=True, with_std=True).transform fails
MinMaxScaler(copy=True, feature_range=(0, 1)).fit fails
MinMaxScaler(copy=True, feature_range=(0, 1)).transform fails
MaxAbsScaler(copy=True).fit fails
MaxAbsScaler(copy=True).transform fails
RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
       with_scaling=True).fit fails
RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
       with_scaling=True).transform fails
{code}

I think the behavior of handling {{NaN}} should keep in line with the impl in 
scikit-learn: the scaled data are likely to be fed into 
classification/regression/clustering or other algs, and it will be dangerous if 
the users are unaware of the `NaN` in scaled data.

There maybe two choices if we decide to change the behavior:
1, add validation for input data and throw exceptions like what scikit-learn 
does, and suggest that {{Imputer}} should be used to handle {{NaN}} before 
scaling;
2, scalers support {{handleInvalid}}, we may further need to make 
{{MultivariateOnlineSummarizer}} and {{Summarizer}} support {{handleInvalid}}.

[~josephkb] [~yanboliang] [~srowen] [~mlnick]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to