Hi all,
I observed some weird behaviour while applying some feature transformations
using MinMaxScaler. More specifically, I was wondering if this behaviour is
intended and makes sense? Especially because I explicitly defined min and max.
Basically, I am preprocessing the MNIST dataset, and thereby scaling the
features between the ranges 0 and 1 using the following code:
# Clear the dataset in the case you ran this cell before.
dataset = dataset.select("features", "label", "label_encoded")
# Apply MinMax normalization to the features.
scaler = MinMaxScaler(min=0.0, max=1.0, inputCol="features",
outputCol="features_normalized")
# Compute summary statistics and generate MinMaxScalerModel.
scaler_model = scaler.fit(dataset)
# Rescale each feature to range [min, max].
dataset = scaler_model.transform(dataset)
Complete code is here:
https://github.com/JoeriHermans/dist-keras/blob/development/examples/mnist.ipynb
(Normalization section)
The original MNIST images are shown in original.png. Whereas the processed
images are shown in processed.png. Note the 0.5 artifacts. I checked the source
code of this particular estimator / transformer and found the following.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L191
According to the documentation:
* <p><blockquote>
* $$
* Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) +
min
* $$
* </blockquote></p>
*
* For the case $E_{max} == E_{min}$, $Rescaled(e_i) = 0.5 * (max + min)$.
So basically, when the difference between E_{max} and E_{min} is 0, we assing
0.5 as a raw value. I am wondering if this is helpful in any situation? Why not
assign 0?
Kind regards,
Joeri
---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]