[04/15] mahout git commit: NO-JIRA Trevors updates

rawkintrevo Sat, 08 Sep 2018 16:36:02 -0700

http://git-wip-us.apache.org/repos/asf/mahout/blob/545648f6/core/src/main/java/org/apache/mahout/math/list/package-info.java
----------------------------------------------------------------------
diff --git a/core/src/main/java/org/apache/mahout/math/list/package-info.java 
b/core/src/main/java/org/apache/mahout/math/list/package-info.java
new file mode 100644
index 0000000..43b5c4b
--- /dev/null
+++ b/core/src/main/java/org/apache/mahout/math/list/package-info.java
@@ -0,0 +1,144 @@
+/**
+ * <HTML>
+ * <BODY>
+ * Resizable lists holding objects or primitive data types such as 
<tt>int</tt>,
+ * <tt>double</tt>, etc. For non-resizable lists (1-dimensional matrices) see
+ * package <code>org.apache.mahout.math.matrix</code>.<p></p>
+ * <h1><a name="Overview"></a>Getting Started</h1>
+ * <h2>1. Overview</h2>
+ * <p>The list package offers flexible object oriented abstractions modelling 
dynamically
+ * resizing lists holding objects or primitive data types such as <tt>int</tt>,
+ * <tt>double</tt>, etc. It is designed to be scalable in terms of performance
+ * and memory requirements.</p>
+ * <p>Features include: </p>
+ * <p></p>
+ * <ul>
+ * <li>Lists operating on objects as well as all primitive data types such as 
<tt>int</tt>,
+ * <tt>double</tt>, etc.
+ * </li>
+ * <li>Compact representations</li>
+ * <li>A number of general purpose list operations including: adding, 
inserting,
+ * removing, iterating, searching, sorting, extracting ranges and copying. All
+ * operations are designed to perform well on mass data.
+ * </li>
+ * <li>Support for quick access to list elements. This is achieved by 
bounds-checking
+ * and non-bounds-checking accessor methods as well as zero-copy 
transformations
+ * to primitive arrays such as <tt>int[]</tt>, <tt>double[]</tt>, etc.
+ * </li>
+ * <li>Allows to use high level algorithms on primitive data types without any
+ * space and time overhead. Operations on primitive arrays, Colt lists and JAL
+ * algorithms can freely be mixed at zero copy overhead.
+ * </li>
+ * </ul>
+ * <p>File-based I/O can be achieved through the standard Java built-in 
serialization
+ * mechanism. All classes implement the {@link java.io.Serializable} interface.
+ * However, the toolkit is entirely decoupled from advanced I/O. It provides 
data
+ * structures and algorithms only.
+ * <p> This toolkit borrows concepts and terminology from the Javasoft <a
+ * 
href="http://www.javasoft.com/products/jdk/1.2/docs/guide/collections/index.html";>
+ * Collections framework</a> written by Josh Bloch and introduced in JDK 1.2.
+ * <h2>2. Introduction</h2>
+ * <p>Lists are fundamental to virtually any application. Large scale 
resizable lists
+ * are, for example, used in scientific computations, simulations database 
management
+ * systems, to name just a few.</p>
+ * <h2></h2>
+ * <p>A list is a container holding elements that can be accessed via 
zero-based
+ * indexes. Lists may be implemented in different ways (most commonly with 
arrays).
+ * A resizable list automatically grows as elements are added. The lists of 
this
+ * package do not automatically shrink. Shrinking needs to be triggered by 
explicitly
+ * calling <tt>trimToSize()</tt> methods.</p>
+ * <p><i>Growing policy</i>: A list implemented with arrays initially has a 
certain
+ * <tt>initialCapacity</tt> - per default 10 elements, but customizable upon 
instance
+ * construction. As elements are added, this capacity may nomore be sufficient.
+ * When a list is automatically grown, its capacity is expanded to 
<tt>1.5*currentCapacity</tt>.
+ * Thus, excessive resizing (involving copying) is avoided.</p>
+ * <h4>Copying</h4>
+ * <p>
+ * <p>Any list can be copied. A copy is <i>equal</i> to the original but 
entirely
+ * independent of the original. So changes in the copy are not reflected in the
+ * original, and vice-versa.
+ * <h2>3. Organization of this package</h2>
+ * <p>Class naming follows the schema 
<tt>&lt;ElementType&gt;&lt;ImplementationTechnique&gt;List</tt>.
+ * For example, we have a {@link org.apache.mahout.math.list.DoubleArrayList}, 
which is a list
+ * holding <tt>double</tt> elements implemented with <tt>double</tt>[] arrays.
+ * </p>
+ * <p>The classes for lists of a given value type are derived from a common 
abstract
+ * base class tagged <tt>Abstract&lt;ElementType&gt;</tt><tt>List</tt>. For 
example,
+ * all lists operating on <tt>double</tt> elements are derived from
+ * {@link org.apache.mahout.math.list.AbstractDoubleList},
+ * which in turn is derived from an abstract base class tying together all 
lists
+ * regardless of value type, {@link org.apache.mahout.math.list.AbstractList}. 
The abstract
+ * base classes provide skeleton implementations for all but few methods. 
Experimental
+ * data layouts (such as compressed, sparse, linked, etc.) can easily be 
implemented
+ * and inherit a rich set of functionality. Have a look at the javadoc <a 
href="package-tree.html">tree
+ * view</a> to get the broad picture.</p>
+ * <h2>4. Example usage</h2>
+ * <p>The following snippet fills a list, randomizes it, extracts the first 
half
+ * of the elements, sums them up and prints the result. It is implemented 
entirely
+ * with accessor methods.</p>
+ * <table>
+ * <td class="PRE">
+ * <pre>
+ * int s = 1000000;<br>AbstractDoubleList list = new DoubleArrayList();
+ * for (int i=0; i&lt;s; i++) { list.add((double)i); }
+ * list.shuffle();
+ * AbstractDoubleList part = list.partFromTo(0,list.size()/2 - 1);
+ * double sum = 0.0;
+ * for (int i=0; i&lt;part.size(); i++) { sum += part.get(i); }
+ * log.info(sum);
+ * </pre>
+ * </td>
+ * </table>
+ * <p> For efficiency, all classes provide back doors to enable 
getting/setting the
+ * backing array directly. In this way, the high level operations of these 
classes
+ * can be used where appropriate, and one can switch to <tt>[]</tt>-array index
+ * notations where necessary. The key methods for this are <tt>public 
&lt;ElementType&gt;[]
+ * elements()</tt> and <tt>public void elements(&lt;ElementType&gt;[])</tt>. 
The
+ * former trustingly returns the array it internally keeps to store the 
elements.
+ * Holding this array in hand, we can use the <tt>[]</tt>-array operator to
+ * perform iteration over large lists without needing to copy the array or 
paying
+ * the performance penalty introduced by accessor methods. Alternatively any 
JAL
+ * algorithm (or other algorithm) can operate on the returned primitive array.
+ * The latter method forces a list to internally hold a user provided array. 
Using
+ * this approach one can avoid needing to copy the elements into the list.
+ * <p>As a consequence, operations on primitive arrays, Colt lists and JAL 
algorithms
+ * can freely be mixed at zero-copy overhead.
+ * <p> Note that such special treatment certainly breaks encapsulation. This 
functionality
+ * is provided for performance reasons only and should only be used when 
absolutely
+ * necessary. Here is the above example in mixed notation:
+ * <table>
+ * <td class="PRE">
+ * <pre>
+ * int s = 1000000;<br>DoubleArrayList list = new DoubleArrayList(s); // 
list.size()==0, capacity==s
+ * list.setSize(s); // list.size()==s<br>double[] values = list.elements();
+ * // zero copy, values.length==s<br>for (int i=0; i&lt;s; i++) { 
values[i]=(double)i; }
+ * list.shuffle();
+ * double sum = 0.0;
+ * int limit = values.length/2;
+ * for (int i=0; i&lt;limit; i++) { sum += values[i]; }
+ * log.info(sum);
+ * </pre>
+ * </td>
+ * </table>
+ * <p> Or even more compact using lists as algorithm objects:
+ * <table>
+ * <td class="PRE">
+ * <pre>
+ * int s = 1000000;<br>double[] values = new double[s];
+ * for (int i=0; i&lt;s; i++) { values[i]=(double)i; }
+ * new DoubleArrayList(values).shuffle(); // zero-copy, shuffle via back door
+ * double sum = 0.0;
+ * int limit = values.length/2;
+ * for (int i=0; i&lt;limit; i++) { sum += values[i]; }
+ * log.info(sum);
+ * </pre>
+ * </td>
+ * </table>
+ * <p>
+ * <h2>5. Notes </h2>
+ * <p>The quicksorts and mergesorts are the JDK 1.2 V1.26 algorithms, modified 
as
+ * necessary to operate on the given data types.
+ * </BODY>
+ * </HTML>
+ */
+package org.apache.mahout.math.list;


http://git-wip-us.apache.org/repos/asf/mahout/blob/545648f6/core/src/main/java/org/apache/mahout/math/map/HashFunctions.java
----------------------------------------------------------------------
diff --git a/core/src/main/java/org/apache/mahout/math/map/HashFunctions.java 
b/core/src/main/java/org/apache/mahout/math/map/HashFunctions.java
new file mode 100644
index 0000000..b749307
--- /dev/null
+++ b/core/src/main/java/org/apache/mahout/math/map/HashFunctions.java
@@ -0,0 +1,115 @@
+/*
+Copyright 1999 CERN - European Organization for Nuclear Research.
+Permission to use, copy, modify, distribute and sell this software and its 
documentation for any purpose 
+is hereby granted without fee, provided that the above copyright notice appear 
in all copies and 
+that both that copyright notice and this permission notice appear in 
supporting documentation. 
+CERN makes no representations about the suitability of this software for any 
purpose. 
+It is provided "as is" without expressed or implied warranty.
+*/
+package org.apache.mahout.math.map;
+
+
+/**
+ * Provides various hash functions.
+ */
+public final class HashFunctions {
+
+  /**
+   * Utility class pattern: all static members, no inheritance.
+   */
+  private HashFunctions() {
+  }
+
+  /**
+   * Returns a hashcode for the specified value.
+   *
+   * @return a hash code value for the specified value.
+   */
+  public static int hash(char value) {
+    return value;
+  }
+
+  /**
+   * Returns a hashcode for the specified value.
+   *
+   * @return a hash code value for the specified value.
+   */
+  public static int hash(double value) {
+    long bits = Double.doubleToLongBits(value);
+    return (int) (bits ^ (bits >>> 32));
+
+    //return (int) Double.doubleToLongBits(value*663608941.737);
+    // this avoids excessive hashCollisions in the case values are of the form 
(1.0, 2.0, 3.0, ...)
+  }
+
+  /**
+   * Returns a hashcode for the specified value.
+   *
+   * @return a hash code value for the specified value.
+   */
+  public static int hash(float value) {
+    return Float.floatToIntBits(value * 663608941.737f);
+    // this avoids excessive hashCollisions in the case values are of the form 
(1.0, 2.0, 3.0, ...)
+  }
+
+  /**
+   * Returns a hashcode for the specified value.
+   * The hashcode computation is similar to the last step
+   * of MurMurHash3.
+   *
+   * @return a hash code value for the specified value.
+   */
+  public static int hash(int value) {
+    int h = value;
+    h ^= h >>> 16;
+    h *= 0x85ebca6b;
+    h ^= h >>> 13;
+    h *= 0xc2b2ae35;
+    h ^= h >>> 16;
+    return h;
+  }
+
+  /**
+   * Returns a hashcode for the specified value.
+   *
+   * @return a hash code value for the specified value.
+   */
+  public static int hash(long value) {
+    return (int) (value ^ (value >> 32));
+    /*
+    value &= 0x7FFFFFFFFFFFFFFFL; // make it >=0 
(0x7FFFFFFFFFFFFFFFL==Long.MAX_VALUE)
+    int hashCode = 0;
+    do hashCode = 31*hashCode + (int) (value%10);
+    while ((value /= 10) > 0);
+
+    return 28629151*hashCode; // spread even further; h*31^5
+    */
+  }
+
+  /**
+   * Returns a hashcode for the specified object.
+   *
+   * @return a hash code value for the specified object.
+   */
+  public static int hash(Object object) {
+    return object == null ? 0 : object.hashCode();
+  }
+
+  /**
+   * Returns a hashcode for the specified value.
+   *
+   * @return a hash code value for the specified value.
+   */
+  public static int hash(short value) {
+    return value;
+  }
+
+  /**
+   * Returns a hashcode for the specified value.
+   *
+   * @return a hash code value for the specified value.
+   */
+  public static int hash(boolean value) {
+    return value ? 1231 : 1237;
+  }
+}

http://git-wip-us.apache.org/repos/asf/mahout/blob/545648f6/core/src/main/java/org/apache/mahout/math/map/OpenHashMap.java
----------------------------------------------------------------------
diff --git a/core/src/main/java/org/apache/mahout/math/map/OpenHashMap.java 
b/core/src/main/java/org/apache/mahout/math/map/OpenHashMap.java
new file mode 100644
index 0000000..0efca4b
--- /dev/null
+++ b/core/src/main/java/org/apache/mahout/math/map/OpenHashMap.java
@@ -0,0 +1,652 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*
+Copyright Ã¯Â¿Â½ 1999 CERN - European Organization for Nuclear Research.
+Permission to use, copy, modify, distribute and sell this software and its 
documentation for any purpose 
+is hereby granted without fee, provided that the above copyright notice appear 
in all copies and 
+that both that copyright notice and this permission notice appear in 
supporting documentation. 
+CERN makes no representations about the suitability of this software for any 
purpose. 
+It is provided "as is" without expressed or implied warranty.
+*/
+package org.apache.mahout.math.map;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+import org.apache.mahout.math.function.ObjectObjectProcedure;
+import org.apache.mahout.math.function.ObjectProcedure;
+import org.apache.mahout.math.set.AbstractSet;
+import org.apache.mahout.math.set.OpenHashSet;
+
+/**
+  * Open hash map. This implements Map, but it does not respect several 
aspects of the Map contract
+  * that impose the very sorts of performance penalities that this class 
exists to avoid.
+  * {@link #entrySet}, {@link #values}, and {@link #keySet()} do 
<strong>not</strong> return
+  * collections that share storage with the main map, and changes to those 
returned objects
+  * are <strong>not</strong> reflected in the container.
+ **/
+public class OpenHashMap<K,V> extends AbstractSet implements Map<K,V> {
+  protected static final byte FREE = 0;
+  protected static final byte FULL = 1;
+  protected static final byte REMOVED = 2;
+  protected static final Object NO_KEY_VALUE = null;
+
+  /** The hash table keys. */
+  protected Object[] table;
+
+  /** The hash table values. */
+  protected Object[] values;
+
+  /** The state of each hash table entry (FREE, FULL, REMOVED). */
+  protected byte[] state;
+
+  /** The number of table entries in state==FREE. */
+  protected int freeEntries;
+
+
+  /** Constructs an empty map with default capacity and default load factors. 
*/
+  public OpenHashMap() {
+    this(DEFAULT_CAPACITY);
+  }
+
+  /**
+   * Constructs an empty map with the specified initial capacity and default 
load factors.
+   *
+   * @param initialCapacity the initial capacity of the map.
+   * @throws IllegalArgumentException if the initial capacity is less than 
zero.
+   */
+  public OpenHashMap(int initialCapacity) {
+    this(initialCapacity, DEFAULT_MIN_LOAD_FACTOR, DEFAULT_MAX_LOAD_FACTOR);
+  }
+
+  /**
+   * Constructs an empty map with the specified initial capacity and the 
specified minimum and maximum load factor.
+   *
+   * @param initialCapacity the initial capacity.
+   * @param minLoadFactor   the minimum load factor.
+   * @param maxLoadFactor   the maximum load factor.
+   * @throws IllegalArgumentException if <tt>initialCapacity < 0 || 
(minLoadFactor < 0.0 || minLoadFactor >= 1.0) ||
+   *                                  (maxLoadFactor <= 0.0 || maxLoadFactor 
>= 1.0) || (minLoadFactor >=
+   *                                  maxLoadFactor)</tt>.
+   */
+  public OpenHashMap(int initialCapacity, double minLoadFactor, double 
maxLoadFactor) {
+    setUp(initialCapacity, minLoadFactor, maxLoadFactor);
+  }
+
+  /** Removes all (key,value) associations from the receiver. Implicitly calls 
<tt>trimToSize()</tt>. */
+  @Override
+  public void clear() {
+    Arrays.fill(this.state, FREE);
+    distinct = 0;
+    freeEntries = table.length; // delta
+    trimToSize();
+  }
+
+  /**
+   * Returns a deep copy of the receiver.
+   *
+   * @return a deep copy of the receiver.
+   */
+  @Override
+  @SuppressWarnings("unchecked")
+  public Object clone() {
+    OpenHashMap<K,V> copy = (OpenHashMap<K,V>) super.clone();
+    copy.table = copy.table.clone();
+    copy.values = copy.values.clone();
+    copy.state = copy.state.clone();
+    return copy;
+  }
+
+  /**
+   * Returns <tt>true</tt> if the receiver contains the specified key.
+   *
+   * @return <tt>true</tt> if the receiver contains the specified key.
+   */
+  @SuppressWarnings("unchecked")
+  @Override
+  public boolean containsKey(Object key) {
+    return indexOfKey((K)key) >= 0;
+  }
+
+  /**
+   * Returns <tt>true</tt> if the receiver contains the specified value.
+   *
+   * @return <tt>true</tt> if the receiver contains the specified value.
+   */
+  @SuppressWarnings("unchecked")
+  @Override
+  public boolean containsValue(Object value) {
+    return indexOfValue((V)value) >= 0;
+  }
+
+  /**
+   * Ensures that the receiver can hold at least the specified number of 
associations without needing to allocate new
+   * internal memory. If necessary, allocates new internal memory and 
increases the capacity of the receiver. <p> This
+   * method never need be called; it is for performance tuning only. Calling 
this method before <tt>put()</tt>ing a
+   * large number of associations boosts performance, because the receiver 
will grow only once instead of potentially
+   * many times and hash collisions get less probable.
+   *
+   * @param minCapacity the desired minimum capacity.
+   */
+  @Override
+  public void ensureCapacity(int minCapacity) {
+    if (table.length < minCapacity) {
+      int newCapacity = nextPrime(minCapacity);
+      rehash(newCapacity);
+    }
+  }
+
+  /**
+   * Applies a procedure to each key of the receiver, if any. Note: Iterates 
over the keys in no particular order.
+   * Subclasses can define a particular order, for example, "sorted by key". 
All methods which <i>can</i> be expressed
+   * in terms of this method (most methods can) <i>must guarantee</i> to use 
the <i>same</i> order defined by this
+   * method, even if it is no particular order. This is necessary so that, for 
example, methods <tt>keys</tt> and
+   * <tt>values</tt> will yield association pairs, not two uncorrelated lists.
+   *
+   * @param procedure the procedure to be applied. Stops iteration if the 
procedure returns <tt>false</tt>, otherwise
+   *                  continues.
+   * @return <tt>false</tt> if the procedure stopped before all keys where 
iterated over, <tt>true</tt> otherwise.
+   */
+  @SuppressWarnings("unchecked")
+  public boolean forEachKey(ObjectProcedure<K> procedure) {
+    for (int i = table.length; i-- > 0;) {
+      if (state[i] == FULL && !procedure.apply((K)table[i])) {
+        return false;
+      }
+    }
+    return true;
+  }
+
+  /**
+   * Applies a procedure to each (key,value) pair of the receiver, if any. 
Iteration order is guaranteed to be
+   * <i>identical</i> to the order used by method {@link 
#forEachKey(ObjectProcedure)}.
+   *
+   * @param procedure the procedure to be applied. Stops iteration if the 
procedure returns <tt>false</tt>, otherwise
+   *                  continues.
+   * @return <tt>false</tt> if the procedure stopped before all keys where 
iterated over, <tt>true</tt> otherwise.
+   */
+  @SuppressWarnings("unchecked")
+  public boolean forEachPair(ObjectObjectProcedure<K,V> procedure) {
+    for (int i = table.length; i-- > 0;) {
+      if (state[i] == FULL && !procedure.apply((K)table[i], (V)values[i])) {
+        return false;
+      }
+    }
+    return true;
+  }
+
+  /**
+   * Returns the value associated with the specified key. It is often a good 
idea to first check with {@link
+   * #containsKey(Object)} whether the given key has a value associated or 
not, i.e. whether there exists an association
+   * for the given key or not.
+   *
+   * @param key the key to be searched for.
+   * @return the value associated with the specified key; <tt>0</tt> if no 
such key is present.
+   */
+  @SuppressWarnings("unchecked")
+  @Override
+  public V get(Object key) {
+    int i = indexOfKey((K)key);
+    if (i < 0) {
+      return null;
+    } //not contained
+    return (V)values[i];
+  }
+
+  /**
+   * @param key the key to be added to the receiver.
+   * @return the index where the key would need to be inserted, if it is not 
already contained. Returns -index-1 if the
+   *         key is already contained at slot index. Therefore, if the 
returned index < 0, then it is already contained
+   *         at slot -index-1. If the returned index >= 0, then it is NOT 
already contained and should be inserted at
+   *         slot index.
+   */
+  protected int indexOfInsertion(K key) {
+    Object[] tab = table;
+    byte[] stat = state;
+    int length = tab.length;
+
+    int hash = key.hashCode() & 0x7FFFFFFF;
+    int i = hash % length;
+    int decrement = hash % (length - 2); // double hashing, see 
http://www.eece.unm.edu/faculty/heileman/hash/node4.html
+    //int decrement = (hash / length) % length;
+    if (decrement == 0) {
+      decrement = 1;
+    }
+
+    // stop if we find a removed or free slot, or if we find the key itself
+    // do NOT skip over removed slots (yes, open addressing is like that...)
+    while (stat[i] == FULL && !equalsMindTheNull(key, tab[i])) {
+      i -= decrement;
+      //hashCollisions++;
+      if (i < 0) {
+        i += length;
+      }
+    }
+
+    if (stat[i] == REMOVED) {
+      // stop if we find a free slot, or if we find the key itself.
+      // do skip over removed slots (yes, open addressing is like that...)
+      // assertion: there is at least one FREE slot.
+      int j = i;
+      while (stat[i] != FREE && (stat[i] == REMOVED || tab[i] != key)) {
+        i -= decrement;
+        //hashCollisions++;
+        if (i < 0) {
+          i += length;
+        }
+      }
+      if (stat[i] == FREE) {
+        i = j;
+      }
+    }
+
+
+    if (stat[i] == FULL) {
+      // key already contained at slot i.
+      // return a negative number identifying the slot.
+      return -i - 1;
+    }
+    // not already contained, should be inserted at slot i.
+    // return a number >= 0 identifying the slot.
+    return i;
+  }
+
+  /**
+   * @param key the key to be searched in the receiver.
+   * @return the index where the key is contained in the receiver, returns -1 
if the key was not found.
+   */
+  protected int indexOfKey(K key) {
+    Object[] tab = table;
+    byte[] stat = state;
+    int length = tab.length;
+
+    int hash = key.hashCode() & 0x7FFFFFFF;
+    int i = hash % length;
+    int decrement = hash % (length - 2); // double hashing, see 
http://www.eece.unm.edu/faculty/heileman/hash/node4.html
+    //int decrement = (hash / length) % length;
+    if (decrement == 0) {
+      decrement = 1;
+    }
+
+    // stop if we find a free slot, or if we find the key itself.
+    // do skip over removed slots (yes, open addressing is like that...)
+    while (stat[i] != FREE && (stat[i] == REMOVED || !equalsMindTheNull(key, 
tab[i]))) {
+      i -= decrement;
+      //hashCollisions++;
+      if (i < 0) {
+        i += length;
+      }
+    }
+
+    if (stat[i] == FREE) {
+      return -1;
+    } // not found
+    return i; //found, return index where key is contained
+  }
+
+  /**
+   * @param value the value to be searched in the receiver.
+   * @return the index where the value is contained in the receiver, returns 
-1 if the value was not found.
+   */
+  protected int indexOfValue(V value) {
+    Object[] val = values;
+    byte[] stat = state;
+
+    for (int i = stat.length; --i >= 0;) {
+      if (stat[i] == FULL && equalsMindTheNull(val[i], value)) {
+        return i;
+      }
+    }
+
+    return -1; // not found
+  }
+
+  /**
+   * Fills all keys contained in the receiver into the specified list. Fills 
the list, starting at index 0. After this
+   * call returns the specified list has a new size that equals 
<tt>this.size()</tt>. 
+   * This method can be used
+   * to iterate over the keys of the receiver.
+   *
+   * @param list the list to be filled, can have any size.
+   */
+  @SuppressWarnings("unchecked")
+  public void keys(List<K> list) {
+    list.clear();
+  
+
+    Object [] tab = table;
+    byte[] stat = state;
+
+    for (int i = tab.length; i-- > 0;) {
+      if (stat[i] == FULL) {
+        list.add((K)tab[i]);
+      }
+    }
+  }
+
+  /**
+   * Associates the given key with the given value. Replaces any old 
<tt>(key,someOtherValue)</tt> association, if
+   * existing.
+   *
+   * @param key   the key the value shall be associated with.
+   * @param value the value to be associated.
+   * @return <tt>true</tt> if the receiver did not already contain such a key; 
<tt>false</tt> if the receiver did
+   *         already contain such a key - the new value has now replaced the 
formerly associated value.
+   */
+  @SuppressWarnings("unchecked")
+  @Override
+  public V put(K key, V value) {
+    int i = indexOfInsertion(key);
+    if (i < 0) { //already contained
+      i = -i - 1;
+      V previous = (V) this.values[i];
+      this.values[i] = value;
+      return previous;
+    }
+
+    if (this.distinct > this.highWaterMark) {
+      int newCapacity = chooseGrowCapacity(this.distinct + 1, 
this.minLoadFactor, this.maxLoadFactor);
+      rehash(newCapacity);
+      return put(key, value);
+    }
+
+    this.table[i] = key;
+    this.values[i] = value;
+    if (this.state[i] == FREE) {
+      this.freeEntries--;
+    }
+    this.state[i] = FULL;
+    this.distinct++;
+
+    if (this.freeEntries < 1) { //delta
+      int newCapacity = chooseGrowCapacity(this.distinct + 1, 
this.minLoadFactor, this.maxLoadFactor);
+      rehash(newCapacity);
+    }
+
+    return null;
+  }
+
+  /**
+   * Rehashes the contents of the receiver into a new table with a smaller or 
larger capacity. This method is called
+   * automatically when the number of keys in the receiver exceeds the high 
water mark or falls below the low water
+   * mark.
+   */
+  @SuppressWarnings("unchecked")
+  protected void rehash(int newCapacity) {
+    int oldCapacity = table.length;
+    //if (oldCapacity == newCapacity) return;
+
+    Object[] oldTable = table;
+    Object[] oldValues = values;
+    byte[] oldState = state;
+
+    Object[] newTable = new Object[newCapacity];
+    Object[] newValues = new Object[newCapacity];
+    byte[] newState = new byte[newCapacity];
+
+    this.lowWaterMark = chooseLowWaterMark(newCapacity, this.minLoadFactor);
+    this.highWaterMark = chooseHighWaterMark(newCapacity, this.maxLoadFactor);
+
+    this.table = newTable;
+    this.values = newValues;
+    this.state = newState;
+    this.freeEntries = newCapacity - this.distinct; // delta
+
+    for (int i = oldCapacity; i-- > 0;) {
+      if (oldState[i] == FULL) {
+        Object element = oldTable[i];
+        int index = indexOfInsertion((K)element);
+        newTable[index] = element;
+        newValues[index] = oldValues[i];
+        newState[index] = FULL;
+      }
+    }
+  }
+
+  /**
+   * Removes the given key with its associated element from the receiver, if 
present.
+   *
+   * @param key the key to be removed from the receiver.
+   * @return <tt>true</tt> if the receiver contained the specified key, 
<tt>false</tt> otherwise.
+   */
+  @SuppressWarnings("unchecked")
+  @Override
+  public V remove(Object key) {
+    int i = indexOfKey((K)key);
+    if (i < 0) {
+      return null;
+    }
+    // key not contained
+    V removed = (V) values[i];
+
+    this.state[i] = REMOVED;
+    //this.values[i]=0; // delta
+    this.distinct--;
+
+    if (this.distinct < this.lowWaterMark) {
+      int newCapacity = chooseShrinkCapacity(this.distinct, 
this.minLoadFactor, this.maxLoadFactor);
+      rehash(newCapacity);
+    }
+
+    return removed;
+  }
+
+  /**
+   * Initializes the receiver.
+   *
+   * @param initialCapacity the initial capacity of the receiver.
+   * @param minLoadFactor   the minLoadFactor of the receiver.
+   * @param maxLoadFactor   the maxLoadFactor of the receiver.
+   * @throws IllegalArgumentException if <tt>initialCapacity < 0 || 
(minLoadFactor < 0.0 || minLoadFactor >= 1.0) ||
+   *                                  (maxLoadFactor <= 0.0 || maxLoadFactor 
>= 1.0) || (minLoadFactor >=
+   *                                  maxLoadFactor)</tt>.
+   */
+  @Override
+  protected void setUp(int initialCapacity, double minLoadFactor, double 
maxLoadFactor) {
+    int capacity = initialCapacity;
+    super.setUp(capacity, minLoadFactor, maxLoadFactor);
+    capacity = nextPrime(capacity);
+    if (capacity == 0) {
+      capacity = 1;
+    } // open addressing needs at least one FREE slot at any time.
+
+    this.table = new Object[capacity];
+    this.values = new Object[capacity];
+    this.state = new byte[capacity];
+
+    // memory will be exhausted long before this pathological case happens, 
anyway.
+    this.minLoadFactor = minLoadFactor;
+    if (capacity == PrimeFinder.LARGEST_PRIME) {
+      this.maxLoadFactor = 1.0;
+    } else {
+      this.maxLoadFactor = maxLoadFactor;
+    }
+
+    this.distinct = 0;
+    this.freeEntries = capacity; // delta
+
+    // lowWaterMark will be established upon first expansion.
+    // establishing it now (upon instance construction) would immediately make 
the table shrink upon first put(...).
+    // After all the idea of an "initialCapacity" implies violating 
lowWaterMarks when an object is young.
+    // See ensureCapacity(...)
+    this.lowWaterMark = 0;
+    this.highWaterMark = chooseHighWaterMark(capacity, this.maxLoadFactor);
+  }
+
+  /**
+   * Trims the capacity of the receiver to be the receiver's current size. 
Releases any superfluous internal memory. An
+   * application can use this operation to minimize the storage of the 
receiver.
+   */
+  @Override
+  public void trimToSize() {
+    // * 1.2 because open addressing's performance exponentially degrades 
beyond that point
+    // so that even rehashing the table can take very long
+    int newCapacity = nextPrime((int) (1 + 1.2 * size()));
+    if (table.length > newCapacity) {
+      rehash(newCapacity);
+    }
+  }
+
+  /**
+   * Access for unit tests.
+   * @param capacity
+   * @param minLoadFactor
+   * @param maxLoadFactor
+   */
+  void getInternalFactors(int[] capacity, 
+      double[] minLoadFactor, 
+      double[] maxLoadFactor) {
+    capacity[0] = table.length;
+    minLoadFactor[0] = this.minLoadFactor;
+    maxLoadFactor[0] = this.maxLoadFactor;
+  }
+
+  private class MapEntry implements Map.Entry<K,V> {
+    private final K key;
+    private final V value;
+    
+    MapEntry(K key, V value) {
+      this.key = key;
+      this.value = value;
+    }
+
+    @Override
+    public K getKey() {
+      return key;
+    }
+
+    @Override
+    public V getValue() {
+      return value;
+    }
+
+    @Override
+    public V setValue(V value) {
+      throw new UnsupportedOperationException("Map.Entry.setValue not 
supported for OpenHashMap");
+    }
+    
+  }
+
+  /**
+   * Allocate a set to contain Map.Entry objects for the pairs and return it.
+   */
+  @Override
+  public Set<java.util.Map.Entry<K,V>> entrySet() {
+    final Set<Entry<K, V>> entries = new OpenHashSet<>();
+    forEachPair(new ObjectObjectProcedure<K,V>() {
+      @Override
+      public boolean apply(K key, V value) {
+        entries.add(new MapEntry(key, value));
+        return true;
+      }
+    });
+    return entries;
+  }
+
+  /**
+   * Allocate a set to contain keys and return it.
+   * This violates the 'backing' provisions of the map interface.
+   */
+  @Override
+  public Set<K> keySet() {
+    final Set<K> keys = new OpenHashSet<>();
+    forEachKey(new ObjectProcedure<K>() {
+      @Override
+      public boolean apply(K element) {
+        keys.add(element);
+        return true;
+      }
+    });
+    return keys;
+  }
+
+  @Override
+  public void putAll(Map<? extends K,? extends V> m) {
+    for (Map.Entry<? extends K, ? extends V> e : m.entrySet()) {
+      put(e.getKey(), e.getValue());
+    }
+  }
+
+  /**
+   * Allocate a list to contain the values and return it.
+   * This violates the 'backing' provision of the Map interface.
+   */
+  @Override
+  public Collection<V> values() {
+    final List<V> valueList = new ArrayList<>();
+    forEachPair(new ObjectObjectProcedure<K,V>() {
+      @Override
+      public boolean apply(K key, V value) {
+        valueList.add(value);
+        return true;
+      }
+    });
+    return valueList;
+  }
+
+  @SuppressWarnings("unchecked")
+  @Override
+  public boolean equals(Object obj) {
+    if (!(obj instanceof OpenHashMap)) {
+      return false;
+    }
+    final OpenHashMap<K,V> o = (OpenHashMap<K,V>) obj;
+    if (o.size() != size()) {
+      return false;
+    }
+    final boolean[] equal = new boolean[1];
+    equal[0] = true;
+    forEachPair(new ObjectObjectProcedure<K,V>() {
+      @Override
+      public boolean apply(K key, V value) {
+        Object ov = o.get(key);
+        if (!value.equals(ov)) {
+          equal[0] = false;
+          return false;
+        }
+        return true;
+      }
+    });
+    return equal[0];
+  }
+
+  @Override
+  public String toString() {
+    final StringBuilder sb = new StringBuilder();
+    sb.append('{');
+    forEachPair(new ObjectObjectProcedure<K,V>() {
+      @Override
+      public boolean apply(K key, V value) {
+        sb.append('[');
+        sb.append(key);
+        sb.append(" -> ");
+        sb.append(value);
+        sb.append("] ");
+        return true;
+      }
+    });
+    sb.append('}');
+    return sb.toString();
+  }
+}

http://git-wip-us.apache.org/repos/asf/mahout/blob/545648f6/core/src/main/java/org/apache/mahout/math/map/PrimeFinder.java
----------------------------------------------------------------------
diff --git a/core/src/main/java/org/apache/mahout/math/map/PrimeFinder.java 
b/core/src/main/java/org/apache/mahout/math/map/PrimeFinder.java
new file mode 100644
index 0000000..b02611e
--- /dev/null
+++ b/core/src/main/java/org/apache/mahout/math/map/PrimeFinder.java
@@ -0,0 +1,145 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.map;
+
+import java.util.Arrays;
+
+/**
+ * Not of interest for users; only for implementors of hashtables.
+ * Used to keep hash table capacities prime numbers.
+ *
+ * <p>Choosing prime numbers as hash table capacities is a good idea to keep 
them working fast,
+ * particularly under hash table expansions.
+ *
+ * <p>However, JDK 1.2, JGL 3.1 and many other toolkits do nothing to keep 
capacities prime.
+ * This class provides efficient means to choose prime capacities.
+ *
+ * <p>Choosing a prime is <tt>O(log 300)</tt> (binary search in a list of 300 
int's).
+ * Memory requirements: 1 KB static memory.
+ *
+ */
+public final class PrimeFinder {
+
+  /** The largest prime this class can generate; currently equal to 
<tt>Integer.MAX_VALUE</tt>. */
+  public static final int LARGEST_PRIME = Integer.MAX_VALUE; //yes, it is 
prime.
+
+  /**
+   * The prime number list consists of 11 chunks. Each chunk contains prime 
numbers. A chunk starts with a prime P1. The
+   * next element is a prime P2. P2 is the smallest prime for which holds: P2 
>= 2*P1. The next element is P3, for which
+   * the same holds with respect to P2, and so on.
+   *
+   * Chunks are chosen such that for any desired capacity >= 1000 the list 
includes a prime number <= desired capacity *
+   * 1.11 (11%). For any desired capacity >= 200 the list includes a prime 
number <= desired capacity * 1.16 (16%). For
+   * any desired capacity >= 16 the list includes a prime number <= desired 
capacity * 1.21 (21%).
+   *
+   * Therefore, primes can be retrieved which are quite close to any desired 
capacity, which in turn avoids wasting
+   * memory. For example, the list includes 
1039,1117,1201,1277,1361,1439,1523,1597,1759,1907,2081. So if you need a
+   * prime >= 1040, you will find a prime <= 1040*1.11=1154.
+   *
+   * Chunks are chosen such that they are optimized for a hashtable 
growthfactor of 2.0; If your hashtable has such a
+   * growthfactor then, after initially "rounding to a prime" upon hashtable 
construction, it will later expand to prime
+   * capacities such that there exist no better primes.
+   *
+   * In total these are about 32*10=320 numbers -> 1 KB of static memory 
needed. If you are stingy, then delete every
+   * second or fourth chunk.
+   */
+
+  private static final int[] PRIME_CAPACITIES = {
+    //chunk #0
+    LARGEST_PRIME,
+
+    //chunk #1
+    5, 11, 23, 47, 97, 197, 397, 797, 1597, 3203, 6421, 12853, 25717, 51437, 
102877, 205759,
+    411527, 823117, 1646237, 3292489, 6584983, 13169977, 26339969, 52679969, 
105359939,
+    210719881, 421439783, 842879579, 1685759167,
+
+    //chunk #2
+    433, 877, 1759, 3527, 7057, 14143, 28289, 56591, 113189, 226379, 452759, 
905551, 1811107,
+    3622219, 7244441, 14488931, 28977863, 57955739, 115911563, 231823147, 
463646329, 927292699,
+    1854585413,
+
+    //chunk #3
+    953, 1907, 3821, 7643, 15287, 30577, 61169, 122347, 244703, 489407, 
978821, 1957651, 3915341,
+    7830701, 15661423, 31322867, 62645741, 125291483, 250582987, 501165979, 
1002331963,
+    2004663929,
+
+    //chunk #4
+    1039, 2081, 4177, 8363, 16729, 33461, 66923, 133853, 267713, 535481, 
1070981, 2141977, 4283963,
+    8567929, 17135863, 34271747, 68543509, 137087021, 274174111, 548348231, 
1096696463,
+
+    //chunk #5
+    31, 67, 137, 277, 557, 1117, 2237, 4481, 8963, 17929, 35863, 71741, 
143483, 286973, 573953,
+    1147921, 2295859, 4591721, 9183457, 18366923, 36733847, 73467739, 
146935499, 293871013,
+    587742049, 1175484103,
+
+    //chunk #6
+    599, 1201, 2411, 4831, 9677, 19373, 38747, 77509, 155027, 310081, 620171, 
1240361, 2480729,
+    4961459, 9922933, 19845871, 39691759, 79383533, 158767069, 317534141, 
635068283, 1270136683,
+
+    //chunk #7
+    311, 631, 1277, 2557, 5119, 10243, 20507, 41017, 82037, 164089, 328213, 
656429, 1312867,
+    2625761, 5251529, 10503061, 21006137, 42012281, 84024581, 168049163, 
336098327, 672196673,
+    1344393353,
+
+    //chunk #8
+    3, 7, 17, 37, 79, 163, 331, 673, 1361, 2729, 5471, 10949, 21911, 43853, 
87719, 175447, 350899,
+    701819, 1403641, 2807303, 5614657, 11229331, 22458671, 44917381, 89834777, 
179669557,
+    359339171, 718678369, 1437356741,
+
+    //chunk #9
+    43, 89, 179, 359, 719, 1439, 2879, 5779, 11579, 23159, 46327, 92657, 
185323, 370661, 741337,
+    1482707, 2965421, 5930887, 11861791, 23723597, 47447201, 94894427, 
189788857, 379577741,
+    759155483, 1518310967,
+
+    //chunk #10
+    379, 761, 1523, 3049, 6101, 12203, 24407, 48817, 97649, 195311, 390647, 
781301, 1562611,
+    3125257, 6250537, 12501169, 25002389, 50004791, 100009607, 200019221, 
400038451, 800076929,
+    1600153859
+  };
+
+
+  static { //initializer
+    // The above prime numbers are formatted for human readability.
+    // To find numbers fast, we sort them once and for all.
+
+    Arrays.sort(PRIME_CAPACITIES);
+  }
+
+  /** Makes this class non instantiable, but still let's others inherit from 
it. */
+  private PrimeFinder() {
+  }
+
+  /**
+   * Returns a prime number which is {@code <= desiredCapacity} and very close 
to {@code desiredCapacity}
+   * (within 11% if {@code desiredCapacity <= 1000}).
+   *
+   * @param desiredCapacity the capacity desired by the user.
+   * @return the capacity which should be used for a hashtable.
+   */
+  public static int nextPrime(int desiredCapacity) {
+    int i = java.util.Arrays.binarySearch(PRIME_CAPACITIES, desiredCapacity);
+    if (i < 0) {
+      // desired capacity not found, choose next prime greater than desired 
capacity
+      i = -i - 1; // remember the semantics of binarySearch...
+    }
+    return PRIME_CAPACITIES[i];
+  }
+
+}

http://git-wip-us.apache.org/repos/asf/mahout/blob/545648f6/core/src/main/java/org/apache/mahout/math/map/QuickOpenIntIntHashMap.java
----------------------------------------------------------------------
diff --git 
a/core/src/main/java/org/apache/mahout/math/map/QuickOpenIntIntHashMap.java 
b/core/src/main/java/org/apache/mahout/math/map/QuickOpenIntIntHashMap.java
new file mode 100644
index 0000000..6a7cef8
--- /dev/null
+++ b/core/src/main/java/org/apache/mahout/math/map/QuickOpenIntIntHashMap.java
@@ -0,0 +1,215 @@
+/*
+Copyright ï¿½ 1999 CERN - European Organization for Nuclear Research.
+Permission to use, copy, modify, distribute and sell this software and its 
documentation for any purpose 
+is hereby granted without fee, provided that the above copyright notice appear 
in all copies and 
+that both that copyright notice and this permission notice appear in 
supporting documentation. 
+CERN makes no representations about the suitability of this software for any 
purpose. 
+It is provided "as is" without expressed or implied warranty.
+*/
+package org.apache.mahout.math.map;
+
+/**
+ * Status: Experimental; Do not use for production yet. Hash map holding 
(key,value) associations of type
+ * <tt>(int-->int)</tt>; Automatically grows and shrinks as needed; 
Implemented using open addressing with double
+ * hashing. First see the <a href="package-summary.html">package summary</a> 
and javadoc <a
+ * href="package-tree.html">tree view</a> to get the broad picture.
+ *
+ * Implements open addressing with double hashing, using "Brent's variation". 
Brent's variation slows insertions a bit
+ * down (not much) but reduces probes (collisions) for successful searches, in 
particular for large load factors. (It
+ * does not improve unsuccessful searches.) See D. Knuth, Searching and 
Sorting, 3rd ed., p.533-545
+ *
+ * @author [email protected]
+ * @version 1.0, 09/24/99
+ * @see java.util.HashMap
+ */
+class QuickOpenIntIntHashMap extends OpenIntIntHashMap {
+  //public int totalProbesSaved = 0; // benchmark only
+
+  /** Constructs an empty map with default capacity and default load factors. 
*/
+  QuickOpenIntIntHashMap() {
+    this(DEFAULT_CAPACITY);
+  }
+
+  /**
+   * Constructs an empty map with the specified initial capacity and default 
load factors.
+   *
+   * @param initialCapacity the initial capacity of the map.
+   * @throws IllegalArgumentException if the initial capacity is less than 
zero.
+   */
+  QuickOpenIntIntHashMap(int initialCapacity) {
+    this(initialCapacity, DEFAULT_MIN_LOAD_FACTOR, DEFAULT_MAX_LOAD_FACTOR);
+  }
+
+  /**
+   * Constructs an empty map with the specified initial capacity and the 
specified minimum and maximum load factor.
+   *
+   * @param initialCapacity the initial capacity.
+   * @param minLoadFactor   the minimum load factor.
+   * @param maxLoadFactor   the maximum load factor.
+   * @throws IllegalArgumentException if <tt>initialCapacity < 0 || 
(minLoadFactor < 0.0 || minLoadFactor >= 1.0) ||
+   *                                  (maxLoadFactor <= 0.0 || maxLoadFactor 
>= 1.0) || (minLoadFactor >=
+   *                                  maxLoadFactor)</tt>.
+   */
+  QuickOpenIntIntHashMap(int initialCapacity, double minLoadFactor, double 
maxLoadFactor) {
+    setUp(initialCapacity, minLoadFactor, maxLoadFactor);
+  }
+
+  /**
+   * Associates the given key with the given value. Replaces any old 
<tt>(key,someOtherValue)</tt> association, if
+   * existing.
+   *
+   * @param key   the key the value shall be associated with.
+   * @param value the value to be associated.
+   * @return <tt>true</tt> if the receiver did not already contain such a key; 
<tt>false</tt> if the receiver did
+   *         already contain such a key - the new value has now replaced the 
formerly associated value.
+   */
+  @Override
+  public boolean put(int key, int value) {
+    /*
+       This is open addressing with double hashing, using "Brent's variation".
+       Brent's variation slows insertions a bit down (not much) but reduces 
probes (collisions) for successful searches,
+       in particular for large load factors.
+       (It does not improve unsuccessful searches.)
+       See D. Knuth, Searching and Sorting, 3rd ed., p.533-545
+
+       h1(key) = hash % M
+       h2(key) = decrement = Max(1, hash/M % M)
+       M is prime = capacity = table.length
+       probing positions are table[(h1-j*h2) % M] for j=0,1,...
+       (M and h2 could also be chosen differently, but h2 is required to be 
relative prime to M.)
+    */
+
+    int[] tab = table;
+    byte[] stat = state;
+    int length = tab.length;
+
+    int hash = HashFunctions.hash(key) & 0x7FFFFFFF;
+    int i = hash % length;
+    int decrement = (hash / length) % length;
+    if (decrement == 0) {
+      decrement = 1;
+    }
+
+    // stop if we find a removed or free slot, or if we find the key itself
+    // do NOT skip over removed slots (yes, open addressing is like that...)
+    //int comp = comparisons;
+    int t = 0;  // the number of probes
+    int p0 = i; // the first position to probe
+    while (stat[i] == FULL && tab[i] != key) {
+      t++;
+      i -= decrement;
+      //hashCollisions++;
+      if (i < 0) {
+        i += length;
+      }
+    }
+    if (stat[i] == FULL) {
+      // key already contained at slot i.
+      this.values[i] = value;
+      return false;
+    }
+    // not already contained, should be inserted at slot i.
+
+    if (this.distinct > this.highWaterMark) {
+      int newCapacity = chooseGrowCapacity(this.distinct + 1, 
this.minLoadFactor, this.maxLoadFactor);
+      rehash(newCapacity);
+      return put(key, value);
+    }
+
+    /*
+    Brent's variation does a local reorganization to reduce probes. It 
essentially means:
+    We test whether it is possible to move the association we probed first 
(table[p0]) out of the way.
+    If this is possible, it will reduce probes for the key to be inserted, 
since it takes its place;
+    it gets hit earlier.
+    However, future probes for the key that we move out of the way will 
increase.
+    Thus we only move it out of the way, if we have a net gain, that is, if we 
save more probes than we loose.
+    For the first probe we safe more than we loose if the number of probes we 
needed was >=2 (t>=2).
+    If the first probe cannot be moved out of the way, we try the next probe 
(p1).
+    Now we safe more than we loose if t>=3.
+    We repeat this until we find that we cannot gain or that we can indeed 
move p(x) out of the way.
+
+    Note: Under the great majority of insertions t<=1, so the loop is entered 
very infrequently.
+    */
+    while (t > 1) {
+      int key0 = tab[p0];
+      hash = HashFunctions.hash(key0) & 0x7FFFFFFF;
+      decrement = (hash / length) % length;
+      if (decrement == 0) {
+        decrement = 1;
+      }
+      int pc = p0 - decrement; // pc = (p0-j*decrement) % M, j=1,2,..
+      if (pc < 0) {
+        pc += length;
+      }
+
+      if (stat[pc] != FREE) { // not a free slot, continue searching for free 
slot to move to, or break.
+        p0 = pc;
+        t--;
+      } else { // free or removed slot found, now move...
+        tab[pc] = key0;
+        stat[pc] = FULL;
+        values[pc] = values[p0];
+        i = p0; // prepare to insert: table[p0]=key
+        t = 0; // break loop
+      }
+    }
+
+    this.table[i] = key;
+    this.values[i] = value;
+    if (this.state[i] == FREE) {
+      this.freeEntries--;
+    }
+    this.state[i] = FULL;
+    this.distinct++;
+
+    if (this.freeEntries < 1) { //delta
+      int newCapacity = chooseGrowCapacity(this.distinct + 1, 
this.minLoadFactor, this.maxLoadFactor);
+      rehash(newCapacity);
+    }
+
+    return true;
+  }
+
+  /**
+   * Rehashes the contents of the receiver into a new table with a smaller or 
larger capacity. This method is called
+   * automatically when the number of keys in the receiver exceeds the high 
water mark or falls below the low water
+   * mark.
+   */
+  @Override
+  protected void rehash(int newCapacity) {
+    int oldCapacity = table.length;
+    //if (oldCapacity == newCapacity) return;
+
+    int[] oldTable = table;
+    int[] oldValues = values;
+    byte[] oldState = state;
+
+    int[] newTable = new int[newCapacity];
+    int[] newValues = new int[newCapacity];
+    byte[] newState = new byte[newCapacity];
+
+    this.lowWaterMark = chooseLowWaterMark(newCapacity, this.minLoadFactor);
+    this.highWaterMark = chooseHighWaterMark(newCapacity, this.maxLoadFactor);
+
+    this.table = newTable;
+    this.values = newValues;
+    this.state = newState;
+    this.freeEntries = newCapacity - this.distinct; // delta
+
+    int tmp = this.distinct;
+    this.distinct = Integer.MIN_VALUE; // switch of watermarks
+    for (int i = oldCapacity; i-- > 0;) {
+      if (oldState[i] == FULL) {
+        put(oldTable[i], oldValues[i]);
+        /*
+        int element = oldTable[i];
+        int index = indexOfInsertion(element);
+        newTable[index]=element;
+        newValues[index]=oldValues[i];
+        newState[index]=FULL;
+        */
+      }
+    }
+    this.distinct = tmp;
+  }
+}

http://git-wip-us.apache.org/repos/asf/mahout/blob/545648f6/core/src/main/java/org/apache/mahout/math/map/package-info.java
----------------------------------------------------------------------
diff --git a/core/src/main/java/org/apache/mahout/math/map/package-info.java 
b/core/src/main/java/org/apache/mahout/math/map/package-info.java
new file mode 100644
index 0000000..9356f45
--- /dev/null
+++ b/core/src/main/java/org/apache/mahout/math/map/package-info.java
@@ -0,0 +1,250 @@
+/**
+ * <HTML>
+ * <BODY>
+ * Automatically growing and shrinking maps holding objects or primitive
+ * data types such as <tt>int</tt>, <tt>double</tt>, etc. Currently all maps 
are
+ * based upon hashing.
+ * <h2><a name="Overview"></a>1. Overview</h2>
+ * <p>The map package offers flexible object oriented abstractions modelling 
automatically
+ * resizing maps. It is designed to be scalable in terms of performance and 
memory
+ * requirements.</p>
+ * <p>Features include: </p>
+ * <p></p>
+ * <ul>
+ * <li>Maps operating on objects as well as all primitive data types such as 
<code>int</code>,
+ * <code>double</code>, etc.
+ * </li>
+ * <li>Compact representations</li>
+ * <li>Support for quick access to associations</li>
+ * <li>A number of general purpose map operations</li>
+ * </ul>
+ * <p>File-based I/O can be achieved through the standard Java built-in 
serialization
+ * mechanism. All classes implement the {@link java.io.Serializable} interface.
+ * However, the toolkit is entirely decoupled from advanced I/O. It provides 
data
+ * structures and algorithms only.
+ * <p> This toolkit borrows some terminology from the Javasoft <a
+ * 
href="http://www.javasoft.com/products/jdk/1.2/docs/guide/collections/index.html";>
+ * Collections framework</a> written by Josh Bloch and introduced in JDK 1.2.
+ * <h2>2. Introduction</h2>
+ * <p>A map is an associative container that manages a set of (key,value) 
pairs.
+ * It is useful for implementing a collection of one-to-one mappings. A 
(key,value)
+ * pair is called an <i>association</i>. A value can be looked up up via its 
key.
+ * Associations can quickly be set, removed and retrieved. They are stored in a
+ * hashing structure based on the hash code of their keys, which is obtained by
+ * using a hash function. </p>
+ * <p> A map can, for example, contain <tt>Name-->Location</tt> associations 
like
+ * <tt>{("Pete", "Geneva"), ("Steve", "Paris"), ("Robert", "New York")}</tt> 
used
+ * in address books or <tt>Index-->Value</tt> mappings like <tt>{(0, 100), (3,
+ * 1000), (100000, 70)}</tt> representing sparse lists or matrices. For example
+ * this could mean at index 0 we have a value of 100, at index 3 we have a 
value
+ * of 1000, at index 1000000 we have a value of 70, and at all other indexes we
+ * have a value of, say, zero. Another example is a map of IP addresses to 
domain
+ * names (DNS). Maps can also be useful to represent<i> multi sets</i>, that 
is,
+ * sets where elements can occur more than once. For multi sets one would have
+ * <tt>Value-->Frequency</tt> mappings like <tt>{(100, 1), (50, 1000), (101, 
3))}</tt>
+ * meaning element 100 occurs 1 time, element 50 occurs 1000 times, element 101
+ * occurs 3 times. Further, maps can also manage 
<tt>ObjectIdentifier-->Object</tt>
+ * mappings like <tt>{(12, obj1), (7, obj2), (10000, obj3), (9, obj4)}</tt> 
used
+ * in Object Databases.
+ * <p> A map cannot contain two or more <i>equal</i> keys; a key can map to at 
most
+ * one value. However, more than one key can map to identical values. For 
primitive
+ * data types "equality" of keys is defined as identity (operator <tt>==</tt>).
+ * For maps using <tt>Object</tt> keys, the meaning of "equality" can be 
specified
+ * by the user upon instance construction. It can either be defined to be 
identity
+ * (operator <tt>==</tt>) or to be given by the method {@link 
java.lang.Object#equals(Object)}.
+ * Associations of kind <tt>(AnyType,Object)</tt> can be of the form 
<tt>(AnyKey,null)
+ * </tt>, i.e. values can be <tt>null</tt>.
+ * <p> The classes of this package make no guarantees as to the order of the 
elements
+ * returned by iterators; in particular, they do not guarantee that the order 
will
+ * remain constant over time.
+ * <h2></h2>
+ * <h4>Copying</h4>
+ * <p>
+ * <p>Any map can be copied. A copy is <i>equal</i> to the original but 
entirely
+ * independent of the original. So changes in the copy are not reflected in the
+ * original, and vice-versa.
+ * <h2>3. Package organization</h2>
+ * <p>For most primitive data types and for objects there exists a separate 
map version.
+ * All versions are just the same, except that they operate on different data 
types.
+ * Colt includes two kinds of implementations for maps: The two different 
implementations
+ * are tagged <b>Chained</b> and <b>Open</b>.
+ * Note: Chained is no more included. Wherever it is mentioned it is of 
historic interest only.</p>
+ * <ul>
+ * <li><b>Chained</b> uses extendible separate chaining with chains holding 
unsorted
+ * dynamically linked collision lists.
+ * <li><b>Open</b> uses extendible open addressing with double hashing.
+ * </ul>
+ * <p>Class naming follows the schema 
<tt>&lt;Implementation&gt;&lt;KeyType&gt;&lt;ValueType&gt;HashMap</tt>.
+ * For example, a {@link org.apache.mahout.math.map.OpenIntDoubleHashMap} 
holds <tt>(int-->double)</tt>
+ * associations and is implemented with open addressing. A {@link 
org.apache.mahout.math.map.OpenIntObjectHashMap}
+ * holds <tt>(int-->Object)</tt> associations and is implemented with open 
addressing.
+ * </p>
+ * <p>The classes for maps of a given (key,value) type are derived from a 
common
+ * abstract base class tagged 
<tt>Abstract&lt;KeyType&gt;&lt;ValueType&gt;</tt><tt>Map</tt>.
+ * For example, all maps operating on <tt>(int-->double)</tt> associations are
+ * derived from {@link org.apache.mahout.math.map.AbstractIntDoubleMap}, which 
in turn is derived
+ * from an abstract base class tying together all maps regardless of assocation
+ * type, {@link org.apache.mahout.math.set.AbstractSet}. The abstract base 
classes provide skeleton
+ * implementations for all but few methods. Experimental layouts (such as 
chaining,
+ * open addressing, extensible hashing, red-black-trees, etc.) can easily be 
implemented
+ * and inherit a rich set of functionality. Have a look at the javadoc <a 
href="package-tree.html">tree
+ * view</a> to get the broad picture.</p>
+ * <h2>4. Example usage</h2>
+ * <TABLE>
+ * <TD CLASS="PRE">
+ * <PRE>
+ * int[]    keys   = {0    , 3     , 100000, 9   };
+ * double[] values = {100.0, 1000.0, 70.0  , 71.0};
+ * AbstractIntDoubleMap map = new OpenIntDoubleHashMap();
+ * // add several associations
+ * for (int i=0; i &lt; keys.length; i++) map.put(keys[i], values[i]);
+ * log.info("map="+map);
+ * log.info("size="+map.size());
+ * log.info(map.containsKey(3));
+ * log.info("get(3)="+map.get(3));
+ * log.info(map.containsKey(4));
+ * log.info("get(4)="+map.get(4));
+ * log.info(map.containsValue(71.0));
+ * log.info("keyOf(71.0)="+map.keyOf(71.0));
+ * // remove one association
+ * map.removeKey(3);
+ * log.info("\nmap="+map);
+ * log.info(map.containsKey(3));
+ * log.info("get(3)="+map.get(3));
+ * log.info(map.containsValue(1000.0));
+ * log.info("keyOf(1000.0)="+map.keyOf(1000.0));
+ * // clear
+ * map.clear();
+ * log.info("\nmap="+map);
+ * log.info("size="+map.size());
+ * </PRE>
+ * </TD>
+ * </TABLE>
+ * yields the following output
+ * <TABLE>
+ * <TD CLASS="PRE">
+ * <PRE>
+ * map=[0->100.0, 3->1000.0, 9->71.0, 100000->70.0]
+ * size=4
+ * true
+ * get(3)=1000.0
+ * false
+ * get(4)=0.0
+ * true
+ * keyOf(71.0)=9
+ * map=[0->100.0, 9->71.0, 100000->70.0]
+ * false
+ * get(3)=0.0
+ * false
+ * keyOf(1000.0)=-2147483648
+ * map=[]
+ * size=0
+ * </PRE>
+ * </TD>
+ * </TABLE>
+ * <h2> 5. Notes </h2>
+ * <p>
+ * Note that implementations are not synchronized.
+ * <p>
+ * Choosing efficient parameters for hash maps is not always easy.
+ * However, since parameters determine efficiency and memory requirements, 
here is a quick guide how to choose them.
+ * If your use case does not heavily operate on hash maps but uses them just 
because they provide
+ * convenient functionality, you can safely skip this section.
+ * For those of you who care, read on.
+ * <p>
+ * There are three parameters that can be customized upon map construction: 
<tt>initialCapacity</tt>,
+ * <tt>minLoadFactor</tt> and <tt>maxLoadFactor</tt>.
+ * The more memory one can afford, the faster a hash map.
+ * The hash map's capacity is the maximum number of associations that can be 
added without needing to allocate new
+ * internal memory.
+ * A larger capacity means faster adding, searching and removing.
+ * The <tt>initialCapacity</tt> corresponds to the capacity used upon instance 
construction.
+ * <p>
+ * The <tt>loadFactor</tt> of a hash map measures the degree of "fullness".
+ * It is given by the number of assocations (<tt>size()</tt>)
+ * divided by the hash map capacity <tt>(0.0 &lt;= loadFactor &lt;= 1.0)</tt>.
+ * The more associations are added, the larger the loadFactor and the more 
hash map performance degrades.
+ * Therefore, when the loadFactor exceeds a customizable threshold 
(<tt>maxLoadFactor</tt>), the hash map is
+ * automatically grown.
+ * In such a way performance degradation can be avoided.
+ * Similarly, when the loadFactor falls below a customizable threshold 
(<tt>minLoadFactor</tt>), the hash map is
+ * automatically shrinked.
+ * In such a way excessive memory consumption can be avoided.
+ * Automatic resizing (both growing and shrinking) obeys the following 
invariant:
+ * <p>
+ * <tt>capacity * minLoadFactor <= size() <= capacity * maxLoadFactor</tt>
+ * <p> The term <tt>capacity * minLoadFactor</tt> is called the <i>low water 
mark</i>,
+ * <tt>capacity * maxLoadFactor</tt> is called the <i>high water mark</i>. In 
other
+ * words, the number of associations may vary within the water mark 
constraints.
+ * When it goes out of range, the map is automatically resized and memory 
consumption
+ * changes proportionally.
+ * <ul>
+ * <li>To tune for memory at the expense of performance, both increase 
<tt>minLoadFactor</tt> and
+ * <tt>maxLoadFactor</tt>.
+ * <li>To tune for performance at the expense of memory, both decrease 
<tt>minLoadFactor</tt> and
+ * <tt>maxLoadFactor</tt>.
+ * As as special case set <tt>minLoadFactor=0</tt> to avoid any automatic 
shrinking.
+ * </ul>
+ * Resizing large hash maps can be time consuming, <tt>O(size())</tt>, and 
should be avoided if possible (maintaining
+ * primes is not the reason).
+ * Unnecessary growing operations can be avoided if the number of associations 
is known before they are added, or can be
+ * estimated.<p>
+ * In such a case good parameters are as follows:
+ * <p>
+ * <i>For chaining:</i>
+ * <br>Set the <tt>initialCapacity = 1.4*expectedSize</tt> or greater.
+ * <br>Set the <tt>maxLoadFactor = 0.8</tt> or greater.
+ * <p>
+ * <i>For open addressing:</i>
+ * <br>Set the <tt>initialCapacity = 2*expectedSize</tt> or greater. 
Alternatively call <tt>ensureCapacity(...)</tt>.
+ * <br>Set the <tt>maxLoadFactor = 0.5</tt>.
+ * <br>Never set <tt>maxLoadFactor &gt; 0.55</tt>; open addressing 
exponentially slows down beyond that point.
+ * <p>
+ * In this way the hash map will never need to grow and still stay fast.
+ * It is never a good idea to set <tt>maxLoadFactor &lt; 0.1</tt>,
+ * because the hash map would grow too often.
+ * If it is entirelly unknown how many associations the application will use,
+ * the default constructor should be used. The map will grow and shrink as 
needed.
+ * <p>
+ * <b>Comparision of chaining and open addressing</b>
+ * <p> Chaining is faster than open addressing, when assuming unconstrained 
memory
+ * consumption. Open addressing is more space efficient than chaining, because
+ * it does not create entry objects but uses primitive arrays which are 
considerably
+ * smaller. Entry objects consume significant amounts of memory compared to the
+ * information they actually hold. Open addressing also poses no problems to 
the
+ * garbage collector. In contrast, chaining can create millions of entry 
objects
+ * which are linked; a nightmare for any garbage collector. In addition, entry
+ * object creation is a bit slow. <br>
+ * Therefore, with the same amount of memory, or even less memory, hash maps 
with
+ * larger capacity can be maintained under open addressing, which yields 
smaller
+ * loadFactors, which in turn keeps performance competitive with chaining. In 
our
+ * benchmarks, using significantly less memory, open addressing usually is not
+ * more than 1.2-1.5 times slower than chaining.
+ * <p><b>Further readings</b>:
+ * <br>Knuth D., The Art of Computer Programming: Searching and Sorting, 3rd 
ed.
+ * <br>Griswold W., Townsend G., The Design and Implementation of Dynamic 
Hashing for Sets and Tables in Icon,
+ * Software - Practice and Experience, Vol. 23(4), 351-367 (April 1993).
+ * <br>Larson P., Dynamic hash tables, Comm. of the ACM, 31, (4), 1988.
+ * <p>
+ * <b>Performance:</b>
+ * <p>
+ * Time complexity:
+ * <br>The classes offer <i>expected</i> time complexity <tt>O(1)</tt> (i.e. 
constant time) for the basic operations
+ * <tt>put</tt>, <tt>get</tt>, <tt>removeKey</tt>, <tt>containsKey</tt> and 
<tt>size</tt>,
+ * assuming the hash function disperses the elements properly among the 
buckets.
+ * Otherwise, pathological cases, although highly improbable, can occur, 
degrading performance to <tt>O(N)</tt> in the
+ * worst case.
+ * Operations <tt>containsValue</tt> and <tt>keyOf</tt> are <tt>O(N)</tt>.
+ * <p>
+ * Memory requirements for <i>open addressing</i>:
+ * <br>worst case: <tt>memory [bytes] = (1/minLoadFactor) * size() * (1 + 
sizeOf(key) + sizeOf(value))</tt>.
+ * <br>best case: <tt>memory [bytes] = (1/maxLoadFactor) * size() * (1 + 
sizeOf(key) + sizeOf(value))</tt>.
+ * Where <tt>sizeOf(int) = 4</tt>, <tt>sizeOf(double) = 8</tt>, 
<tt>sizeOf(Object) = 4</tt>, etc.
+ * Thus, an <tt>OpenIntIntHashMap</tt> with minLoadFactor=0.25 and 
maxLoadFactor=0.5 and 1000000 associations uses
+ * between 17 MB and 34 MB.
+ * The same map with 1000 associations uses between 17 and 34 KB.
+ * <p>
+ * </BODY>
+ * </HTML>
+ */
+package org.apache.mahout.math.map;

http://git-wip-us.apache.org/repos/asf/mahout/blob/545648f6/core/src/main/java/org/apache/mahout/math/package-info.java
----------------------------------------------------------------------
diff --git a/core/src/main/java/org/apache/mahout/math/package-info.java 
b/core/src/main/java/org/apache/mahout/math/package-info.java
new file mode 100644
index 0000000..de664f0
--- /dev/null
+++ b/core/src/main/java/org/apache/mahout/math/package-info.java
@@ -0,0 +1,4 @@
+/**
+ * Core base classes; Operations on primitive arrays such as sorting, 
partitioning and permuting.
+ */
+package org.apache.mahout.math;

http://git-wip-us.apache.org/repos/asf/mahout/blob/545648f6/core/src/main/java/org/apache/mahout/math/random/AbstractSamplerFunction.java
----------------------------------------------------------------------
diff --git 
a/core/src/main/java/org/apache/mahout/math/random/AbstractSamplerFunction.java 
b/core/src/main/java/org/apache/mahout/math/random/AbstractSamplerFunction.java
new file mode 100644
index 0000000..d657fd9
--- /dev/null
+++ 
b/core/src/main/java/org/apache/mahout/math/random/AbstractSamplerFunction.java
@@ -0,0 +1,39 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.mahout.math.random;
+
+import org.apache.mahout.math.function.DoubleFunction;
+
+/**
+ * This shim allows samplers to be used to initialize vectors.
+ */
+public abstract class AbstractSamplerFunction extends DoubleFunction 
implements Sampler<Double> {
+  /**
+   * Apply the function to the argument and return the result
+   *
+   * @param ignored Ignored argument
+   * @return A sample from this distribution.
+   */
+  @Override
+  public double apply(double ignored) {
+    return sample();
+  }
+
+  @Override
+  public abstract Double sample();
+}

http://git-wip-us.apache.org/repos/asf/mahout/blob/545648f6/core/src/main/java/org/apache/mahout/math/random/ChineseRestaurant.java
----------------------------------------------------------------------
diff --git 
a/core/src/main/java/org/apache/mahout/math/random/ChineseRestaurant.java 
b/core/src/main/java/org/apache/mahout/math/random/ChineseRestaurant.java
new file mode 100644
index 0000000..8127b92
--- /dev/null
+++ b/core/src/main/java/org/apache/mahout/math/random/ChineseRestaurant.java
@@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.mahout.math.random;
+
+import com.google.common.base.Preconditions;
+import org.apache.mahout.common.RandomUtils;
+import org.apache.mahout.math.list.DoubleArrayList;
+
+import java.util.Random;
+
+/**
+ *
+ * Generates samples from a generalized Chinese restaurant process (or 
Pittman-Yor process).
+ *
+ * The number of values drawn exactly once will asymptotically be equal to the 
discount parameter
+ * as the total number of draws T increases without bound.  The number of 
unique values sampled will
+ * increase as O(alpha * log T) if discount = 0 or O(alpha * T^discount) for 
discount > 0.
+ */
+public final class ChineseRestaurant implements Sampler<Integer> {
+
+  private final double alpha;
+  private double weight = 0;
+  private double discount = 0;
+  private final DoubleArrayList weights = new DoubleArrayList();
+  private final Random rand = RandomUtils.getRandom();
+
+  /**
+   * Constructs a Dirichlet process sampler.  This is done by setting discount 
= 0.
+   * @param alpha  The strength parameter for the Dirichlet process.
+   */
+  public ChineseRestaurant(double alpha) {
+    this(alpha, 0);
+  }
+
+  /**
+   * Constructs a Pitman-Yor sampler.
+   *
+   * @param alpha     The strength parameter that drives the number of unique 
values as a function of draws.
+   * @param discount  The discount parameter that drives the percentage of 
values that occur once in a large sample.
+   */
+  public ChineseRestaurant(double alpha, double discount) {
+    Preconditions.checkArgument(alpha > 0, "Strength Parameter, alpha must be 
greater then 0!");
+    Preconditions.checkArgument(discount >= 0 && discount <= 1, "Must be: 0 <= 
discount <= 1");
+    this.alpha = alpha;
+    this.discount = discount;
+  }
+
+  @Override
+  public Integer sample() {
+    double u = rand.nextDouble() * (alpha + weight);
+    for (int j = 0; j < weights.size(); j++) {
+      // select existing options with probability (w_j - d) / (alpha + w)
+      if (u < weights.get(j) - discount) {
+        weights.set(j, weights.get(j) + 1);
+        weight++;
+        return j;
+      } else {
+        u -= weights.get(j) - discount;
+      }
+    }
+
+    // if no existing item selected, pick new item with probability (alpha - 
d*t) / (alpha + w)
+    // where t is number of pre-existing cases
+    weights.add(1);
+    weight++;
+    return weights.size() - 1;
+  }
+
+  /**
+   * @return the number of unique values that have been returned.
+   */
+  public int size() {
+    return weights.size();
+  }
+
+  /**
+   * @return the number draws so far.
+   */
+  public int count() {
+    return (int) weight;
+  }
+
+  /**
+   * @param j Which value to test.
+   * @return  The number of times that j has been returned so far.
+   */
+  public int count(int j) {
+    Preconditions.checkArgument(j >= 0);
+
+    if (j < weights.size()) {
+      return (int) weights.get(j);
+    } else {
+      return 0;
+    }
+  }
+}

http://git-wip-us.apache.org/repos/asf/mahout/blob/545648f6/core/src/main/java/org/apache/mahout/math/random/Empirical.java
----------------------------------------------------------------------
diff --git a/core/src/main/java/org/apache/mahout/math/random/Empirical.java 
b/core/src/main/java/org/apache/mahout/math/random/Empirical.java
new file mode 100644
index 0000000..78bfec5
--- /dev/null
+++ b/core/src/main/java/org/apache/mahout/math/random/Empirical.java
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.mahout.math.random;
+
+import com.google.common.base.Preconditions;
+import org.apache.mahout.common.RandomUtils;
+
+import java.util.Random;
+
+/**
+ * Samples from an empirical cumulative distribution.
+ */
+public final class Empirical extends AbstractSamplerFunction {
+  private final Random gen;
+  private final boolean exceedMinimum;
+  private final boolean exceedMaximum;
+
+  private final double[] x;
+  private final double[] y;
+  private final int n;
+
+  /**
+   * Sets up a sampler for a specified empirical cumulative distribution 
function.  The distribution
+   * can have optional exponential tails on either or both ends, but otherwise 
does a linear
+   * interpolation between known points.
+   *
+   * @param exceedMinimum  Should we generate samples less than the smallest 
quantile (i.e. generate a left tail)?
+   * @param exceedMaximum  Should we generate samples greater than the largest 
observed quantile (i.e. generate a right
+   *                       tail)?
+   * @param samples        The number of samples observed to get the quantiles.
+   * @param ecdf           Alternating values that represent which percentile 
(in the [0..1] range)
+   *                       and values.  For instance, if you have the min, 
median and max of 1, 3, 10
+   *                       you should pass 0.0, 1, 0.5, 3, 1.0, 10.  Note that 
the list must include
+   *                       the 0-th (1.0-th) quantile if the left (right) tail 
is not allowed.
+   */
+  public Empirical(boolean exceedMinimum, boolean exceedMaximum, int samples, 
double... ecdf) {
+    Preconditions.checkArgument(ecdf.length % 2 == 0, "ecdf must have an even 
count of values");
+    Preconditions.checkArgument(samples >= 3, "Sample size must be >= 3");
+
+    // if we can't exceed the observed bounds, then we have to be given the 
bounds.
+    Preconditions.checkArgument(exceedMinimum || ecdf[0] == 0);
+    Preconditions.checkArgument(exceedMaximum || ecdf[ecdf.length - 2] == 1);
+
+    gen = RandomUtils.getRandom();
+
+    n = ecdf.length / 2;
+    x = new double[n];
+    y = new double[n];
+
+    double lastX = ecdf[1];
+    double lastY = ecdf[0];
+    for (int i = 0; i < ecdf.length; i += 2) {
+      // values have to be monotonic increasing
+      Preconditions.checkArgument(i == 0 || ecdf[i + 1] > lastY);
+      y[i / 2] = ecdf[i + 1];
+      lastY = y[i / 2];
+
+      // quantiles have to be in [0,1] and be monotonic increasing
+      Preconditions.checkArgument(ecdf[i] >= 0 && ecdf[i] <= 1);
+      Preconditions.checkArgument(i == 0 || ecdf[i] > lastX);
+
+      x[i / 2] = ecdf[i];
+      lastX = x[i / 2];
+    }
+
+    // squeeze a bit to allow for unobserved tails
+    double x0 = exceedMinimum ? 0.5 / samples : 0;
+    double x1 = 1 - (exceedMaximum ? 0.5 / samples : 0);
+    for (int i = 0; i < n; i++) {
+      x[i] = x[i] * (x1 - x0) + x0;
+    }
+
+    this.exceedMinimum = exceedMinimum;
+    this.exceedMaximum = exceedMaximum;
+  }
+
+  @Override
+  public Double sample() {
+    return sample(gen.nextDouble());
+  }
+
+  public double sample(double u) {
+    if (exceedMinimum && u < x[0]) {
+      // generate from left tail
+      if (u == 0) {
+        u = 1.0e-16;
+      }
+      return y[0] + Math.log(u / x[0]) * x[0] * (y[1] - y[0]) / (x[1] - x[0]);
+    } else if (exceedMaximum && u > x[n - 1]) {
+      if (u == 1) {
+        u = 1 - 1.0e-16;
+      }
+      // generate from right tail
+      double dy = y[n - 1] - y[n - 2];
+      double dx = x[n - 1] - x[n - 2];
+      return y[n - 1] - Math.log((1 - u) / (1 - x[n - 1])) * (1 - x[n - 1]) * 
dy / dx;
+    } else {
+      // linear interpolation
+      for (int i = 1; i < n; i++) {
+        if (x[i] > u) {
+          double dy = y[i] - y[i - 1];
+          double dx = x[i] - x[i - 1];
+          return y[i - 1] + (u - x[i - 1]) * dy / dx;
+        }
+      }
+      throw new RuntimeException(String.format("Can't happen (%.3f is not in 
[%.3f,%.3f]", u, x[0], x[n - 1]));
+    }
+  }
+}

http://git-wip-us.apache.org/repos/asf/mahout/blob/545648f6/core/src/main/java/org/apache/mahout/math/random/IndianBuffet.java
----------------------------------------------------------------------
diff --git a/core/src/main/java/org/apache/mahout/math/random/IndianBuffet.java 
b/core/src/main/java/org/apache/mahout/math/random/IndianBuffet.java
new file mode 100644
index 0000000..27b5d84
--- /dev/null
+++ b/core/src/main/java/org/apache/mahout/math/random/IndianBuffet.java
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.mahout.math.random;
+
+import com.google.common.base.CharMatcher;
+import com.google.common.base.Charsets;
+import com.google.common.base.Splitter;
+import com.google.common.collect.Iterables;
+import com.google.common.collect.Lists;
+import com.google.common.io.LineProcessor;
+import com.google.common.io.Resources;
+import org.apache.mahout.common.RandomUtils;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Random;
+
+/**
+ * Samples a "document" from an IndianBuffet process.
+ *
+ * See http://mlg.eng.cam.ac.uk/zoubin/talks/turin09.pdf for details
+ */
+public final class IndianBuffet<T> implements Sampler<List<T>> {
+  private final List<Integer> count = Lists.newArrayList();
+  private int documents = 0;
+  private final double alpha;
+  private WordFunction<T> converter = null;
+  private final Random gen;
+
+  public IndianBuffet(double alpha, WordFunction<T> converter) {
+    this.alpha = alpha;
+    this.converter = converter;
+    gen = RandomUtils.getRandom();
+  }
+
+  public static IndianBuffet<Integer> createIntegerDocumentSampler(double 
alpha) {
+    return new IndianBuffet<>(alpha, new IdentityConverter());
+  }
+
+  public static IndianBuffet<String> createTextDocumentSampler(double alpha) {
+    return new IndianBuffet<>(alpha, new WordConverter());
+  }
+
+  @Override
+  public List<T> sample() {
+    List<T> r = Lists.newArrayList();
+    if (documents == 0) {
+      double n = new PoissonSampler(alpha).sample();
+      for (int i = 0; i < n; i++) {
+        r.add(converter.convert(i));
+        count.add(1);
+      }
+      documents++;
+    } else {
+      documents++;
+      int i = 0;
+      for (double cnt : count) {
+        if (gen.nextDouble() < cnt / documents) {
+          r.add(converter.convert(i));
+          count.set(i, count.get(i) + 1);
+        }
+        i++;
+      }
+      int newItems = new PoissonSampler(alpha / documents).sample().intValue();
+      for (int j = 0; j < newItems; j++) {
+        r.add(converter.convert(i + j));
+        count.add(1);
+      }
+    }
+    return r;
+  }
+
+  private interface WordFunction<T> {
+    T convert(int i);
+  }
+
+  /**
+   * Just converts to an integer.
+   */
+  public static class IdentityConverter implements WordFunction<Integer> {
+    @Override
+    public Integer convert(int i) {
+      return i;
+    }
+  }
+
+  /**
+   * Converts to a string.
+   */
+  public static class StringConverter implements WordFunction<String> {
+    @Override
+    public String convert(int i) {
+      return String.valueOf(i);
+    }
+  }
+
+  /**
+   * Converts to one of a list of common English words for reasonably small 
integers and converts
+   * to a token like w_92463 for big integers.
+   */
+  public static final class WordConverter implements WordFunction<String> {
+    private final Splitter onSpace = 
Splitter.on(CharMatcher.WHITESPACE).omitEmptyStrings().trimResults();
+    private final List<String> words;
+
+    public WordConverter() {
+      try {
+        words = Resources.readLines(Resources.getResource("words.txt"), 
Charsets.UTF_8,
+                                    new LineProcessor<List<String>>() {
+            private final List<String> theWords = Lists.newArrayList();
+
+            @Override
+            public boolean processLine(String line) {
+              Iterables.addAll(theWords, onSpace.split(line));
+              return true;
+            }
+
+            @Override
+            public List<String> getResult() {
+              return theWords;
+            }
+          });
+      } catch (IOException e) {
+        throw new ImpossibleException(e);
+      }
+    }
+
+    @Override
+    public String convert(int i) {
+      if (i < words.size()) {
+        return words.get(i);
+      } else {
+        return "w_" + i;
+      }
+    }
+  }
+
+  public static class ImpossibleException extends RuntimeException {
+    public ImpossibleException(Throwable e) {
+      super(e);
+    }
+  }
+}

http://git-wip-us.apache.org/repos/asf/mahout/blob/545648f6/core/src/main/java/org/apache/mahout/math/random/Missing.java
----------------------------------------------------------------------
diff --git a/core/src/main/java/org/apache/mahout/math/random/Missing.java 
b/core/src/main/java/org/apache/mahout/math/random/Missing.java
new file mode 100644
index 0000000..8141a71
--- /dev/null
+++ b/core/src/main/java/org/apache/mahout/math/random/Missing.java
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.mahout.math.random;
+
+import java.util.Random;
+
+import org.apache.mahout.common.RandomUtils;
+
+/**
+ * Models data with missing values.  Note that all variables with the same 
fraction of missing
+ * values will have the same sequence of missing values.  Similarly, if two 
variables have
+ * missing probabilities of p1 > p2, then all of the p2 missing values will 
also be missing for
+ * p1.
+ */
+public final class Missing<T> implements Sampler<T> {
+  private final Random gen;
+  private final double p;
+  private final Sampler<T> delegate;
+  private final T missingMarker;
+
+  public Missing(int seed, double p, Sampler<T> delegate, T missingMarker) {
+    this.p = p;
+    this.delegate = delegate;
+    this.missingMarker = missingMarker;
+    gen = RandomUtils.getRandom(seed);
+  }
+
+  public Missing(double p, Sampler<T> delegate, T missingMarker) {
+    this(1, p, delegate, missingMarker);
+  }
+
+  public Missing(double p, Sampler<T> delegate) {
+    this(1, p, delegate, null);
+  }
+
+  @Override
+  public T sample() {
+    if (gen.nextDouble() >= p) {
+      return delegate.sample();
+    } else {
+      return missingMarker;
+    }
+  }
+}

http://git-wip-us.apache.org/repos/asf/mahout/blob/545648f6/core/src/main/java/org/apache/mahout/math/random/MultiNormal.java
----------------------------------------------------------------------
diff --git a/core/src/main/java/org/apache/mahout/math/random/MultiNormal.java 
b/core/src/main/java/org/apache/mahout/math/random/MultiNormal.java
new file mode 100644
index 0000000..748d4e8
--- /dev/null
+++ b/core/src/main/java/org/apache/mahout/math/random/MultiNormal.java
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.mahout.math.random;
+
+import org.apache.mahout.common.RandomUtils;
+import org.apache.mahout.math.DenseVector;
+import org.apache.mahout.math.DiagonalMatrix;
+import org.apache.mahout.math.Matrix;
+import org.apache.mahout.math.Vector;
+import org.apache.mahout.math.function.DoubleFunction;
+
+import java.util.Random;
+
+/**
+ * Samples from a multi-variate normal distribution.
+ * <p/>
+ * This is done by sampling from several independent unit normal distributions 
to get a vector u.
+ * The sample value that is returned is then A u + m where A is derived from 
the covariance matrix
+ * and m is the mean of the result.
+ * <p/>
+ * If \Sigma is the desired covariance matrix, then you can use any value of A 
such that A' A =
+ * \Sigma.  The Cholesky decomposition can be used to compute A if \Sigma is 
positive definite.
+ * Slightly more expensive is to use the SVD U S V' = \Sigma and then set A = 
U \sqrt{S}.
+ *
+ * Useful special cases occur when \Sigma is diagonal so that A = 
\sqrt(\Sigma) or where \Sigma = r I.
+ *
+ * Another special case is where m = 0.
+ */
+public class MultiNormal implements Sampler<Vector> {
+  private final Random gen;
+  private final int dimension;
+  private final Matrix scale;
+  private final Vector mean;
+
+  /**
+   * Constructs a sampler with diagonal scale matrix.
+   * @param diagonal The diagonal elements of the scale matrix.
+   */
+  public MultiNormal(Vector diagonal) {
+    this(new DiagonalMatrix(diagonal), null);
+  }
+
+  /**
+   * Constructs a sampler with diagonal scale matrix and (potentially)
+   * non-zero mean.
+   * @param diagonal The scale matrix's principal diagonal.
+   * @param mean The desired mean.  Set to null if zero mean is desired.
+   */
+  public MultiNormal(Vector diagonal, Vector mean) {
+    this(new DiagonalMatrix(diagonal), mean);
+  }
+
+  /**
+   * Constructs a sampler with non-trivial scale matrix and mean.
+   */
+  public MultiNormal(Matrix a, Vector mean) {
+    this(a, mean, a.columnSize());
+  }
+
+  public MultiNormal(int dimension) {
+    this(null, null, dimension);
+  }
+
+  public MultiNormal(double radius, Vector mean) {
+    this(new DiagonalMatrix(radius, mean.size()), mean);
+  }
+
+  private MultiNormal(Matrix scale, Vector mean, int dimension) {
+    gen = RandomUtils.getRandom();
+    this.dimension = dimension;
+    this.scale = scale;
+    this.mean = mean;
+  }
+
+  @Override
+  public Vector sample() {
+    Vector v = new DenseVector(dimension).assign(
+      new DoubleFunction() {
+        @Override
+        public double apply(double ignored) {
+          return gen.nextGaussian();
+        }
+      }
+    );
+    if (mean != null) {
+      if (scale != null) {
+        return scale.times(v).plus(mean);
+      } else {
+        return v.plus(mean);
+      }
+    } else {
+      if (scale != null) {
+        return scale.times(v);
+      } else {
+        return v;
+      }
+    }
+  }
+
+  public Vector getScale() {
+    return mean;
+  }
+}

[04/15] mahout git commit: NO-JIRA Trevors updates

Reply via email to