Github user yinxusen commented on the pull request:
https://github.com/apache/spark/pull/268#issuecomment-39110100
@mengxr I am not very sure of the concept of sparse vector. In your
example, do you mean the column is `Vector(1.0, 0.0, 2.0, 0.0, 3.0, 0.0, 0.0)`
or
`RDD(
Vector(1.0),
Vector(0.0),
Vector(2.0),
Vector(0.0),
Vector(3.0),
Vector(0.0),
Vector(0.0)
)`?
If it is the case 1, then it is easy to rewrite it in O(nnz), otherwise, it
will be difficult, because we cannot judge whether a column is sparse or not
before we count the nnz. If the case 1 is your mean, then I think I should
treat sparse vector different with the dense one with the following code:
`RDD.take(1).head.type match {
case DenseVector[Double] => xxx
case SparseVector[Double] => xxx
}`.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---