[
https://issues.apache.org/jira/browse/MAHOUT-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dan Filimon updated MAHOUT-1117:
--------------------------------
Description:
No *Vector classes (DenseVector, WeightedVector, etc.) implement hashCode().
In working on improving clustering in Mahout, Ted Dunning wrote prototype code
for Streaming KMeans and Ball KMeans, that I'm working with him on. These need
to be used together in the MapReduce version.
However, in Ball KMeans, we initialize the clusters using a probabilistic
approach similar to k-means++. This however requires a
Multinomial<WeightedVector> distribution of the points we want to cluster to
pick the centroids.
Internally, the Multinomial<T> uses a HashMap to keep track of the values it
can sample from.
Since Vectors don't override Object's hashCode(), it is possible to get the
same value multiple times in the map (as long as the references differ).
This is less of an issue because of how we're adding the vectors to the
multinomial (we can guarantee that the references will be unique) and once
MAHOUT-1116 is resolved the hashing will work okay for our needs.
It still seems that it would be useful to have hashable vectors.
What do you think? And what would a hash function look like?
Affects Version/s: 1.0
> Vectors are not hashable
> ------------------------
>
> Key: MAHOUT-1117
> URL: https://issues.apache.org/jira/browse/MAHOUT-1117
> Project: Mahout
> Issue Type: Improvement
> Affects Versions: 1.0
> Reporter: Dan Filimon
> Priority: Minor
>
> No *Vector classes (DenseVector, WeightedVector, etc.) implement hashCode().
> In working on improving clustering in Mahout, Ted Dunning wrote prototype
> code for Streaming KMeans and Ball KMeans, that I'm working with him on.
> These need to be used together in the MapReduce version.
> However, in Ball KMeans, we initialize the clusters using a probabilistic
> approach similar to k-means++. This however requires a
> Multinomial<WeightedVector> distribution of the points we want to cluster to
> pick the centroids.
> Internally, the Multinomial<T> uses a HashMap to keep track of the values it
> can sample from.
> Since Vectors don't override Object's hashCode(), it is possible to get the
> same value multiple times in the map (as long as the references differ).
> This is less of an issue because of how we're adding the vectors to the
> multinomial (we can guarantee that the references will be unique) and once
> MAHOUT-1116 is resolved the hashing will work okay for our needs.
> It still seems that it would be useful to have hashable vectors.
> What do you think? And what would a hash function look like?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira