I see. I think I read the documentation a little bit too quick :) My apologies.
Kind regards, Joeri ________________________________________ From: Sean Owen [so...@cloudera.com] Sent: 21 November 2016 21:32 To: Joeri Hermans; dev@spark.apache.org Subject: Re: MinMaxScaler behaviour It's a degenerate case of course. 0, 0.5 and 1 all make about as much sense. Is there a strong convention elsewhere to use 0? Min/max scaling is the wrong thing to do for a data set like this anyway. What you probably intend to do is scale each image so that its max intensity is 1 and min intensity is 0, but that's different. Scaling each pixel across all images doesn't make as much sense. On Mon, Nov 21, 2016 at 8:26 PM Joeri Hermans <joeri.raymond.e.herm...@cern.ch<mailto:joeri.raymond.e.herm...@cern.ch>> wrote: Hi all, I observed some weird behaviour while applying some feature transformations using MinMaxScaler. More specifically, I was wondering if this behaviour is intended and makes sense? Especially because I explicitly defined min and max. Basically, I am preprocessing the MNIST dataset, and thereby scaling the features between the ranges 0 and 1 using the following code: # Clear the dataset in the case you ran this cell before. dataset = dataset.select("features", "label", "label_encoded") # Apply MinMax normalization to the features. scaler = MinMaxScaler(min=0.0, max=1.0, inputCol="features", outputCol="features_normalized") # Compute summary statistics and generate MinMaxScalerModel. scaler_model = scaler.fit(dataset) # Rescale each feature to range [min, max]. dataset = scaler_model.transform(dataset) Complete code is here: https://github.com/JoeriHermans/dist-keras/blob/development/examples/mnist.ipynb (Normalization section) The original MNIST images are shown in original.png. Whereas the processed images are shown in processed.png. Note the 0.5 artifacts. I checked the source code of this particular estimator / transformer and found the following. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L191 According to the documentation: * <p><blockquote> * $$ * Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + min * $$ * </blockquote></p> * * For the case $E_{max} == E_{min}$, $Rescaled(e_i) = 0.5 * (max + min)$. So basically, when the difference between E_{max} and E_{min} is 0, we assing 0.5 as a raw value. I am wondering if this is helpful in any situation? Why not assign 0? Kind regards, Joeri --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org> --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org