[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718966#action_12718966
]
Sean Owen commented on MAHOUT-121:
----------------------------------
Here's a start at that last suggestion. Anyone care to see it finished? I can
hack it out.
public final class FastIntDouble {
private static final double DEFAULT_VALUE = 0.0;
private int[] indices;
private double[] values;
private int size;
public FastIntDouble(int capacity) {
indices = new int[capacity];
values = new double[capacity];
size = 0;
}
private void growTo(int newCapacity) {
if (newCapacity > indices.length) {
int[] newIndices = new int[newCapacity];
System.arraycopy(indices, 0, newIndices, 0, size);
indices = newIndices;
double[] newValues = new double[newCapacity];
System.arraycopy(values, 0, newValues, 0, size);
values = newValues;
}
}
private int find(int index) {
int low = 0;
int high = size - 1;
while (low <= high) {
int mid = (low + high) >>> 1;
int midVal = indices[mid];
if (midVal < index) {
low = mid + 1;
} else if (midVal > index) {
high = mid - 1;
} else {
return mid;
}
}
return -(low + 1);
}
public double get(int index) {
int offset = find(index);
return offset >= 0 ? values[offset] : DEFAULT_VALUE;
}
public void set(int index, double value) {
int offset = find(index);
if (offset >= 0) {
if (value == DEFAULT_VALUE) {
System.arraycopy(indices, offset + 1, indices, offset, size - offset);
System.arraycopy(values, offset + 1, values, offset, size - offset);
size--;
} else {
values[offset] = value;
}
} else {
if (value != DEFAULT_VALUE) {
if (size >= indices.length) {
growTo(size << 1);
}
int at = -offset - 1;
if (size > at) {
System.arraycopy(indices, at, indices, at + 1, size - at);
System.arraycopy(values, at, values, at + 1, size - at);
}
indices[at] = index;
values[at] = value;
size++;
}
}
}
}
> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
> Key: MAHOUT-121
> URL: https://issues.apache.org/jira/browse/MAHOUT-121
> Project: Mahout
> Issue Type: Improvement
> Components: Matrix
> Reporter: Shashikant Kore
> Attachments: mahout-121.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors.
> The complete dataset has few tens of thousands of feature items but each
> vector has only couple of hundred feature items. For this, there is an
> optimization in distance calculation, a link to which I found the archives of
> Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization. The test setup had 2000 document vectors
> with few hundred items. I ran canopy generation with Euclidean distance and
> t1, t2 values as 250 and 200.
>
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives
> is computationally expensive. I changed the sparse vector implementation to
> used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.
>
> Licensing of Trove seems to be an issue which needs to be addressed.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.