Hi scikit-learn experts,
I am using the sparse matrices generated by
sklearn.feature_extraction.FeatureHasher
and want to compute cosine distances between feature vectors. What is the
best way to do this?
When I was hacking on this a month ago, I found that the logic in the CSR
sparse matrix was causing slow dot products, so I wrote the version below
that uses element-wise multiplication.
Is this a good way to compute cosine distances between feature hashed
counters?
class FeatureHashingCounter(object):
_default_num_features = 2**31 - 1
_hasher = FeatureHasher(_default_num_features, input_type='dict',
non_negative=False)
def __init__(self, data):
self._matrix = scipy.sparse.csr_matrix((1,num_features))
def smart_dot( fhc1, fhc2 ):
## use element-wise multiplication
return fhc1._matrix.multiply( fhc2._matrix ).sum()
def cosine( fhc1, fhc2 ):
dot = smart_dot(fhc1, fhc2)
norm1 = math.sqrt(smart_dot(fhc1, fhc1))
norm2 = math.sqrt(smart_dot(fhc2, fhc2))
result = float(dot)/norm1/norm2
return max(result, 0)
Also, I'm curious truncation: is there a clean way to delete features
that have low counts? My current implementation is involves sorting on
count, truncating, and then sorting again on the sparse matrix indices to
make a valid sparse matrix again. Is there a better way?
Thanks for any advice!
John
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general