[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718966#action_12718966
 ] 

Sean Owen commented on MAHOUT-121:
----------------------------------

Here's a start at that last suggestion. Anyone care to see it finished? I can 
hack it out.

public final class FastIntDouble {

  private static final double DEFAULT_VALUE = 0.0;

  private int[] indices;
  private double[] values;
  private int size;

  public FastIntDouble(int capacity) {
    indices = new int[capacity];
    values = new double[capacity];
    size = 0;
  }

  private void growTo(int newCapacity) {
    if (newCapacity > indices.length) {
      int[] newIndices = new int[newCapacity];
      System.arraycopy(indices, 0, newIndices, 0, size);
      indices = newIndices;
      double[] newValues = new double[newCapacity];
      System.arraycopy(values, 0, newValues, 0, size);
      values = newValues;
    }
  }

  private int find(int index) {
    int low = 0;
    int high = size - 1;
    while (low <= high) {
      int mid = (low + high) >>> 1;
      int midVal = indices[mid];
      if (midVal < index) {
        low = mid + 1;
      } else if (midVal > index) {
        high = mid - 1;
      } else {
        return mid;
      }
    }
    return -(low + 1);
  }

  public double get(int index) {
    int offset = find(index);
    return offset >= 0 ? values[offset] : DEFAULT_VALUE;
  }

  public void set(int index, double value) {
    int offset = find(index);
    if (offset >= 0) {
      if (value == DEFAULT_VALUE) {
        System.arraycopy(indices, offset + 1, indices, offset, size - offset);
        System.arraycopy(values, offset + 1, values, offset, size - offset);
        size--;
      } else {
        values[offset] = value;
      }
    } else {
      if (value != DEFAULT_VALUE) {
        if (size >= indices.length) {
          growTo(size << 1);
        }
        int at = -offset - 1;
        if (size > at) {
          System.arraycopy(indices, at, indices, at + 1, size - at);
          System.arraycopy(values, at, values, at + 1, size - at);
        }
        indices[at] = index;
        values[at] = value;
        size++;
      }
    }
  }

}

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: mahout-121.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. 
> The complete dataset has few tens of thousands of feature items but each 
> vector has only couple of hundred feature items. For this, there is an 
> optimization in distance calculation, a link to which I found the archives of 
> Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors 
> with few hundred items.  I ran canopy generation with Euclidean distance and 
> t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives 
> is computationally expensive. I changed the sparse vector  implementation to 
> used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
>  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to