Hi scikit-learn experts,

I am using the sparse matrices generated by 
sklearn.feature_extraction.FeatureHasher

and want to compute cosine distances between feature vectors.  What is the 
best way to do this?

When I was hacking on this a month ago, I found that the logic in the CSR 
sparse matrix was causing slow dot products, so I wrote the version below 
that uses element-wise multiplication.

Is this a good way to compute cosine distances between feature hashed 
counters?


class FeatureHashingCounter(object):
     _default_num_features = 2**31 - 1
     _hasher = FeatureHasher(_default_num_features, input_type='dict', 
non_negative=False)

     def __init__(self, data):
         self._matrix = scipy.sparse.csr_matrix((1,num_features))


def smart_dot( fhc1, fhc2 ):
     ## use element-wise multiplication
     return fhc1._matrix.multiply( fhc2._matrix ).sum()


def cosine( fhc1, fhc2 ):
     dot = smart_dot(fhc1, fhc2)

     norm1 = math.sqrt(smart_dot(fhc1, fhc1))
     norm2 = math.sqrt(smart_dot(fhc2, fhc2))

     result = float(dot)/norm1/norm2

     return max(result, 0)



Also, I'm curious truncation:  is there a clean way to delete features 
that have low counts?  My current implementation is involves sorting on 
count, truncating, and then sorting again on the sparse matrix indices to 
make a valid sparse matrix again.  Is there a better way?


Thanks for any advice!

John

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to