[
https://issues.apache.org/jira/browse/SPARK-17001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15422511#comment-15422511
]
Sean Owen commented on SPARK-17001:
-----------------------------------
Yeah, that sounds like how it works now. StandardScaler will actually cause an
error if asked to center sparse data. The problem is that, sometimes sparse
data is represented that way because it's small-er than the dense
representation, not necessarily because the dense representation is too large
to work with. In particular, VectorAssembler will output small sparse vectors
if there are enough 0s, and that means it can't be used with StandardScaler
with centering, even if it would be perfectly fine.
My attitude is that the user should be able to opt in to this behavior if
desired. Yes it would potentially cause a job to fail if you centered massive
sparse vectors, but that at least will be a fairly clear error. It seems better
to potentially allow that than make StandardScaler unable to do centering in
the common case.
> Enable standardScaler to standardize sparse vectors when withMean=True
> ----------------------------------------------------------------------
>
> Key: SPARK-17001
> URL: https://issues.apache.org/jira/browse/SPARK-17001
> Project: Spark
> Issue Type: Improvement
> Affects Versions: 2.0.0
> Reporter: Tobi Bosede
> Priority: Minor
>
> When withMean = true, StandardScaler will not handle sparse vectors, and
> instead throw an exception. This is presumably because subtracting the mean
> makes a sparse vector dense, and this can be undesirable.
> However, VectorAssembler generates vectors that may be a mix of sparse and
> dense, even when vectors are smallish, depending on their values. It's common
> to feed this into StandardScaler, but it would fail sometimes depending on
> the input if withMean = true. This is kind of surprising.
> StandardScaler should go ahead and operate on sparse vectors and subtract the
> mean, if explicitly asked to do so with withMean, on the theory that the user
> knows what he/she is doing, and there is otherwise no way to make this work.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]