[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...

itg-abby Tue, 18 Oct 2016 13:27:07 -0700

Github user itg-abby commented on the issue:

    https://github.com/apache/spark/pull/15496
  
    It is possible that I am missing something or that I have unintentionally 
obfuscated this pull request, I will try summarizing my understanding/purpose 
and see if it sheds any light:
    
    DenseVector allows calls to numpy directly (i.e. DenseVector.mean() ) and 
always stores the array values in the object attribute DenseVector.array , this 
allows for a lot of neat numpy functions to be run on the array values without 
any trouble.
    
    SparseVector works differently, it never stores the full set of values as a 
full array. Instead, it uses a 'trick' which only searches non-zero index/value 
pairs if a specific entry is asked for (this can be found in the __geitem__ 
attribute for SparseVector). This prevents numpy functions from being usable on 
the SparseVector since there is no actual array to operate on directly. 
However, a conversion function is provided, toArray().
    
    The solution proposed can, in effect, be thought of as a purely syntactical 
shortening from SparseVector.toArray().mean() to simply SparseVector.mean() . 
Thus, this should not introduce any increased complexity compared to how things 
are now. The current status of this object is confusing in  that the intuitive 
function-call SparseVector.mean() just throws out an "AttributeError: 
'SparseVector' object has no attribute 'mean'". 
    
    As mildly hinted at on JIRA, there are even better implementations which 
could follow this one. For example simply replacing directly calling numpy by 
manually providing the same functions with reduced complexity. Much along the 
lines of how __getitem__ was made for SparseVectors, rather than the typical 
array slicing that DenseVector has.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...

Reply via email to