RE: MinMaxScaler behaviour

Joeri Hermans Mon, 21 Nov 2016 12:45:10 -0800

I see. I think I read the documentation a little bit too quick :)

My apologies.

Kind regards,

Joeri
________________________________________
From: Sean Owen [[email protected]]
Sent: 21 November 2016 21:32
To: Joeri Hermans; [email protected]
Subject: Re: MinMaxScaler behaviour

It's a degenerate case of course. 0, 0.5 and 1 all make about as much sense. Is 
there a strong convention elsewhere to use 0?

Min/max scaling is the wrong thing to do for a data set like this anyway. What 
you probably intend to do is scale each image so that its max intensity is 1 
and min intensity is 0, but that's different. Scaling each pixel across all 
images doesn't make as much sense.

On Mon, Nov 21, 2016 at 8:26 PM Joeri Hermans 
<[email protected]<mailto:[email protected]>> wrote:
Hi all,

I observed some weird behaviour while applying some feature transformations 
using MinMaxScaler. More specifically, I was wondering if this behaviour is 
intended and makes sense? Especially because I explicitly defined min and max.

Basically, I am preprocessing the MNIST dataset, and thereby scaling the 
features between the ranges 0 and 1 using the following code:

# Clear the dataset in the case you ran this cell before.
dataset = dataset.select("features", "label", "label_encoded")
# Apply MinMax normalization to the features.
scaler = MinMaxScaler(min=0.0, max=1.0, inputCol="features", 
outputCol="features_normalized")
# Compute summary statistics and generate MinMaxScalerModel.
scaler_model = scaler.fit(dataset)
# Rescale each feature to range [min, max].
dataset = scaler_model.transform(dataset)

Complete code is here: 
https://github.com/JoeriHermans/dist-keras/blob/development/examples/mnist.ipynb
 (Normalization section)

The original MNIST images are shown in original.png. Whereas the processed 
images are shown in processed.png. Note the 0.5 artifacts. I checked the source 
code of this particular estimator / transformer and found the following.

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L191

According to the documentation:

 * <p><blockquote>
 *    $$
 *    Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + 
min
 *    $$
 * </blockquote></p>
 *
 * For the case $E_{max} == E_{min}$, $Rescaled(e_i) = 0.5 * (max + min)$.

So basically, when the difference between E_{max} and E_{min} is 0, we assing 
0.5 as a raw value. I am wondering if this is helpful in any situation? Why not 
assign 0?

Kind regards,

Joeri
---------------------------------------------------------------------
To unsubscribe e-mail: 
[email protected]<mailto:[email protected]>

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

RE: MinMaxScaler behaviour

Reply via email to