Yigal Weinberger created SPARK-23535:
----------------------------------------

             Summary: MinMaxScaler return 0.5 for an all zero column
                 Key: SPARK-23535
                 URL: https://issues.apache.org/jira/browse/SPARK-23535
             Project: Spark
          Issue Type: Bug
          Components: ML
    Affects Versions: 2.0.0
            Reporter: Yigal Weinberger


When applying MinMaxScaler on a column that contains only 0 the output is 0.5 
for all the column. 

This is inconsistent with sklearn implementation

 

Steps to reproduce:

 

 
{code:java}
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.linalg import Vectors

dataFrame = spark.createDataFrame([
    (0, Vectors.dense([1.0, 0.1, -1.0]),),
    (1, Vectors.dense([2.0, 1.1, 1.0]),),
    (2, Vectors.dense([3.0, 10.1, 3.0]),)
], ["id", "features"])

scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")

# Compute summary statistics and generate MinMaxScalerModel
scalerModel = scaler.fit(dataFrame)

# rescale each feature to range [min, max].
scaledData = scalerModel.transform(dataFrame)
print("Features scaled to range: [%f, %f]" % (scaler.getMin(), scaler.getMax()))
scaledData.select("features", "scaledFeatures").show()
{code}
Features scaled to range: [0.000000, 1.000000]

+--------------+--------------+
|features|scaledFeatures|

+--------------+--------------+

| [1.0,0.1,0.0]| [0.0,0.0,*0.5*]| |

[2.0,1.1,0.0]| [0.5,0.1,*0.5*]| |

[3.0,10.1,0.0]| [1.0,1.0,*0.5*]|

+--------------+--------------+

 

VS.
{code:java}
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler(copy=False)
test = np.array([[1.0, 0.1, 0],[2.0, 1.1, 0],[3.0, 10.1, 0]])
print (mms.fit_transform(test))
{code}
 

Output:

[[ 0. 0. *0.* ]

[ 0.5 0.1 *0.* ]

[ 1. 1. *0.* ]]

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to