Tobi Bosede created SPARK-17001:
-----------------------------------

             Summary: Enable standardScaler to standardize sparse vectors when 
withMean=True
                 Key: SPARK-17001
                 URL: https://issues.apache.org/jira/browse/SPARK-17001
             Project: Spark
          Issue Type: Improvement
    Affects Versions: 2.0.0, 1.6.1, 1.6.0, 1.5.1, 1.5.0, 1.4.1, 1.4.0
            Reporter: Tobi Bosede


standardScaler does not allow the mean to be subtracted from sparse vectors. It 
will only divide by the standard deviation to keep the vector sparse.  
withMean=True should be default behavior and should apply an *offset if the 
vector is sparse, whereas there would be normal subtraction if the vector is 
dense. This way the default behavior of standardScaler will always be what is 
generally understood to be standardization, as opposed to people thinking they 
are standardizing when they are not. To allow the data to still fit in memory 
we want to avoid simply converting the sparse vector to a dense one.
*What is meant by "offset":
Imagine a sparse vector 1:3 3:7 which conceptually represents 0 3 0 7. Imagine 
it also has an offset stored which applies to all elements. If it is -2 then it 
now represents -2 1 -2 5, but this requires just one extra value to store. It 
only helps with storage of a shifted sparse vector; iterating still typically 
requires iterating all elements. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to