Github user itg-abby commented on the issue:
https://github.com/apache/spark/pull/15496
It is possible that I am missing something or that I have unintentionally
obfuscated this pull request, I will try summarizing my understanding/purpose
and see if it sheds any light:
DenseVector allows calls to numpy directly (i.e. DenseVector.mean() ) and
always stores the array values in the object attribute DenseVector.array , this
allows for a lot of neat numpy functions to be run on the array values without
any trouble.
SparseVector works differently, it never stores the full set of values as a
full array. Instead, it uses a 'trick' which only searches non-zero index/value
pairs if a specific entry is asked for (this can be found in the __geitem__
attribute for SparseVector). This prevents numpy functions from being usable on
the SparseVector since there is no actual array to operate on directly.
However, a conversion function is provided, toArray().
The solution proposed can, in effect, be thought of as a purely syntactical
shortening from SparseVector.toArray().mean() to simply SparseVector.mean() .
Thus, this should not introduce any increased complexity compared to how things
are now. The current status of this object is confusing in that the intuitive
function-call SparseVector.mean() just throws out an "AttributeError:
'SparseVector' object has no attribute 'mean'".
As mildly hinted at on JIRA, there are even better implementations which
could follow this one. For example simply replacing directly calling numpy by
manually providing the same functions with reduced complexity. Much along the
lines of how __getitem__ was made for SparseVectors, rather than the typical
array slicing that DenseVector has.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]