[
https://issues.apache.org/jira/browse/MAHOUT-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adam Bozanich updated MAHOUT-1157:
----------------------------------
Description:
AbstractCluster.formatVector's use of the size field of the given vector causes
problems when the vector is sparse.
I clustered a handful of vectors which had been initialized with a cardinality
of Integer.MAX_VALUE. Running seqdump on the resulting clusteredPoints took
over four minutes. This is because formatVector() was iterating over the
entire integer space for every vector.
was:
AbstractCluster.formatVector's use of the size field of the given vector causes
problems when the vector is sparse.
When reading WeightedVectorWriteables from the clusteredPoints directory that
was created by running kmeans with the -cl flag, the embedded
RandomAccessSparseVector is being instantiated with
I clustered a handful of vectors which had been initialized with a cardinality
of Integer.MAX_VALUE. Running seqdump on the resulting clusteredPoints took
over four minutes.
> AbstractCluster.formatVector iteration bug.
> -------------------------------------------
>
> Key: MAHOUT-1157
> URL: https://issues.apache.org/jira/browse/MAHOUT-1157
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.7
> Reporter: Adam Bozanich
> Attachments: mahout.patch
>
>
> AbstractCluster.formatVector's use of the size field of the given vector
> causes problems when the vector is sparse.
> I clustered a handful of vectors which had been initialized with a
> cardinality of Integer.MAX_VALUE. Running seqdump on the resulting
> clusteredPoints took over four minutes. This is because formatVector() was
> iterating over the entire integer space for every vector.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira