http://git-wip-us.apache.org/repos/asf/mahout/blob/545648f6/core/src/main/java/org/apache/mahout/math/list/package-info.java ---------------------------------------------------------------------- diff --git a/core/src/main/java/org/apache/mahout/math/list/package-info.java b/core/src/main/java/org/apache/mahout/math/list/package-info.java new file mode 100644 index 0000000..43b5c4b --- /dev/null +++ b/core/src/main/java/org/apache/mahout/math/list/package-info.java @@ -0,0 +1,144 @@ +/** + * <HTML> + * <BODY> + * Resizable lists holding objects or primitive data types such as <tt>int</tt>, + * <tt>double</tt>, etc. For non-resizable lists (1-dimensional matrices) see + * package <code>org.apache.mahout.math.matrix</code>.<p></p> + * <h1><a name="Overview"></a>Getting Started</h1> + * <h2>1. Overview</h2> + * <p>The list package offers flexible object oriented abstractions modelling dynamically + * resizing lists holding objects or primitive data types such as <tt>int</tt>, + * <tt>double</tt>, etc. It is designed to be scalable in terms of performance + * and memory requirements.</p> + * <p>Features include: </p> + * <p></p> + * <ul> + * <li>Lists operating on objects as well as all primitive data types such as <tt>int</tt>, + * <tt>double</tt>, etc. + * </li> + * <li>Compact representations</li> + * <li>A number of general purpose list operations including: adding, inserting, + * removing, iterating, searching, sorting, extracting ranges and copying. All + * operations are designed to perform well on mass data. + * </li> + * <li>Support for quick access to list elements. This is achieved by bounds-checking + * and non-bounds-checking accessor methods as well as zero-copy transformations + * to primitive arrays such as <tt>int[]</tt>, <tt>double[]</tt>, etc. + * </li> + * <li>Allows to use high level algorithms on primitive data types without any + * space and time overhead. Operations on primitive arrays, Colt lists and JAL + * algorithms can freely be mixed at zero copy overhead. + * </li> + * </ul> + * <p>File-based I/O can be achieved through the standard Java built-in serialization + * mechanism. All classes implement the {@link java.io.Serializable} interface. + * However, the toolkit is entirely decoupled from advanced I/O. It provides data + * structures and algorithms only. + * <p> This toolkit borrows concepts and terminology from the Javasoft <a + * href="http://www.javasoft.com/products/jdk/1.2/docs/guide/collections/index.html"> + * Collections framework</a> written by Josh Bloch and introduced in JDK 1.2. + * <h2>2. Introduction</h2> + * <p>Lists are fundamental to virtually any application. Large scale resizable lists + * are, for example, used in scientific computations, simulations database management + * systems, to name just a few.</p> + * <h2></h2> + * <p>A list is a container holding elements that can be accessed via zero-based + * indexes. Lists may be implemented in different ways (most commonly with arrays). + * A resizable list automatically grows as elements are added. The lists of this + * package do not automatically shrink. Shrinking needs to be triggered by explicitly + * calling <tt>trimToSize()</tt> methods.</p> + * <p><i>Growing policy</i>: A list implemented with arrays initially has a certain + * <tt>initialCapacity</tt> - per default 10 elements, but customizable upon instance + * construction. As elements are added, this capacity may nomore be sufficient. + * When a list is automatically grown, its capacity is expanded to <tt>1.5*currentCapacity</tt>. + * Thus, excessive resizing (involving copying) is avoided.</p> + * <h4>Copying</h4> + * <p> + * <p>Any list can be copied. A copy is <i>equal</i> to the original but entirely + * independent of the original. So changes in the copy are not reflected in the + * original, and vice-versa. + * <h2>3. Organization of this package</h2> + * <p>Class naming follows the schema <tt><ElementType><ImplementationTechnique>List</tt>. + * For example, we have a {@link org.apache.mahout.math.list.DoubleArrayList}, which is a list + * holding <tt>double</tt> elements implemented with <tt>double</tt>[] arrays. + * </p> + * <p>The classes for lists of a given value type are derived from a common abstract + * base class tagged <tt>Abstract<ElementType></tt><tt>List</tt>. For example, + * all lists operating on <tt>double</tt> elements are derived from + * {@link org.apache.mahout.math.list.AbstractDoubleList}, + * which in turn is derived from an abstract base class tying together all lists + * regardless of value type, {@link org.apache.mahout.math.list.AbstractList}. The abstract + * base classes provide skeleton implementations for all but few methods. Experimental + * data layouts (such as compressed, sparse, linked, etc.) can easily be implemented + * and inherit a rich set of functionality. Have a look at the javadoc <a href="package-tree.html">tree + * view</a> to get the broad picture.</p> + * <h2>4. Example usage</h2> + * <p>The following snippet fills a list, randomizes it, extracts the first half + * of the elements, sums them up and prints the result. It is implemented entirely + * with accessor methods.</p> + * <table> + * <td class="PRE"> + * <pre> + * int s = 1000000;<br>AbstractDoubleList list = new DoubleArrayList(); + * for (int i=0; i<s; i++) { list.add((double)i); } + * list.shuffle(); + * AbstractDoubleList part = list.partFromTo(0,list.size()/2 - 1); + * double sum = 0.0; + * for (int i=0; i<part.size(); i++) { sum += part.get(i); } + * log.info(sum); + * </pre> + * </td> + * </table> + * <p> For efficiency, all classes provide back doors to enable getting/setting the + * backing array directly. In this way, the high level operations of these classes + * can be used where appropriate, and one can switch to <tt>[]</tt>-array index + * notations where necessary. The key methods for this are <tt>public <ElementType>[] + * elements()</tt> and <tt>public void elements(<ElementType>[])</tt>. The + * former trustingly returns the array it internally keeps to store the elements. + * Holding this array in hand, we can use the <tt>[]</tt>-array operator to + * perform iteration over large lists without needing to copy the array or paying + * the performance penalty introduced by accessor methods. Alternatively any JAL + * algorithm (or other algorithm) can operate on the returned primitive array. + * The latter method forces a list to internally hold a user provided array. Using + * this approach one can avoid needing to copy the elements into the list. + * <p>As a consequence, operations on primitive arrays, Colt lists and JAL algorithms + * can freely be mixed at zero-copy overhead. + * <p> Note that such special treatment certainly breaks encapsulation. This functionality + * is provided for performance reasons only and should only be used when absolutely + * necessary. Here is the above example in mixed notation: + * <table> + * <td class="PRE"> + * <pre> + * int s = 1000000;<br>DoubleArrayList list = new DoubleArrayList(s); // list.size()==0, capacity==s + * list.setSize(s); // list.size()==s<br>double[] values = list.elements(); + * // zero copy, values.length==s<br>for (int i=0; i<s; i++) { values[i]=(double)i; } + * list.shuffle(); + * double sum = 0.0; + * int limit = values.length/2; + * for (int i=0; i<limit; i++) { sum += values[i]; } + * log.info(sum); + * </pre> + * </td> + * </table> + * <p> Or even more compact using lists as algorithm objects: + * <table> + * <td class="PRE"> + * <pre> + * int s = 1000000;<br>double[] values = new double[s]; + * for (int i=0; i<s; i++) { values[i]=(double)i; } + * new DoubleArrayList(values).shuffle(); // zero-copy, shuffle via back door + * double sum = 0.0; + * int limit = values.length/2; + * for (int i=0; i<limit; i++) { sum += values[i]; } + * log.info(sum); + * </pre> + * </td> + * </table> + * <p> + * <h2>5. Notes </h2> + * <p>The quicksorts and mergesorts are the JDK 1.2 V1.26 algorithms, modified as + * necessary to operate on the given data types. + * </BODY> + * </HTML> + */ +package org.apache.mahout.math.list;
http://git-wip-us.apache.org/repos/asf/mahout/blob/545648f6/core/src/main/java/org/apache/mahout/math/map/HashFunctions.java ---------------------------------------------------------------------- diff --git a/core/src/main/java/org/apache/mahout/math/map/HashFunctions.java b/core/src/main/java/org/apache/mahout/math/map/HashFunctions.java new file mode 100644 index 0000000..b749307 --- /dev/null +++ b/core/src/main/java/org/apache/mahout/math/map/HashFunctions.java @@ -0,0 +1,115 @@ +/* +Copyright 1999 CERN - European Organization for Nuclear Research. +Permission to use, copy, modify, distribute and sell this software and its documentation for any purpose +is hereby granted without fee, provided that the above copyright notice appear in all copies and +that both that copyright notice and this permission notice appear in supporting documentation. +CERN makes no representations about the suitability of this software for any purpose. +It is provided "as is" without expressed or implied warranty. +*/ +package org.apache.mahout.math.map; + + +/** + * Provides various hash functions. + */ +public final class HashFunctions { + + /** + * Utility class pattern: all static members, no inheritance. + */ + private HashFunctions() { + } + + /** + * Returns a hashcode for the specified value. + * + * @return a hash code value for the specified value. + */ + public static int hash(char value) { + return value; + } + + /** + * Returns a hashcode for the specified value. + * + * @return a hash code value for the specified value. + */ + public static int hash(double value) { + long bits = Double.doubleToLongBits(value); + return (int) (bits ^ (bits >>> 32)); + + //return (int) Double.doubleToLongBits(value*663608941.737); + // this avoids excessive hashCollisions in the case values are of the form (1.0, 2.0, 3.0, ...) + } + + /** + * Returns a hashcode for the specified value. + * + * @return a hash code value for the specified value. + */ + public static int hash(float value) { + return Float.floatToIntBits(value * 663608941.737f); + // this avoids excessive hashCollisions in the case values are of the form (1.0, 2.0, 3.0, ...) + } + + /** + * Returns a hashcode for the specified value. + * The hashcode computation is similar to the last step + * of MurMurHash3. + * + * @return a hash code value for the specified value. + */ + public static int hash(int value) { + int h = value; + h ^= h >>> 16; + h *= 0x85ebca6b; + h ^= h >>> 13; + h *= 0xc2b2ae35; + h ^= h >>> 16; + return h; + } + + /** + * Returns a hashcode for the specified value. + * + * @return a hash code value for the specified value. + */ + public static int hash(long value) { + return (int) (value ^ (value >> 32)); + /* + value &= 0x7FFFFFFFFFFFFFFFL; // make it >=0 (0x7FFFFFFFFFFFFFFFL==Long.MAX_VALUE) + int hashCode = 0; + do hashCode = 31*hashCode + (int) (value%10); + while ((value /= 10) > 0); + + return 28629151*hashCode; // spread even further; h*31^5 + */ + } + + /** + * Returns a hashcode for the specified object. + * + * @return a hash code value for the specified object. + */ + public static int hash(Object object) { + return object == null ? 0 : object.hashCode(); + } + + /** + * Returns a hashcode for the specified value. + * + * @return a hash code value for the specified value. + */ + public static int hash(short value) { + return value; + } + + /** + * Returns a hashcode for the specified value. + * + * @return a hash code value for the specified value. + */ + public static int hash(boolean value) { + return value ? 1231 : 1237; + } +} http://git-wip-us.apache.org/repos/asf/mahout/blob/545648f6/core/src/main/java/org/apache/mahout/math/map/OpenHashMap.java ---------------------------------------------------------------------- diff --git a/core/src/main/java/org/apache/mahout/math/map/OpenHashMap.java b/core/src/main/java/org/apache/mahout/math/map/OpenHashMap.java new file mode 100644 index 0000000..0efca4b --- /dev/null +++ b/core/src/main/java/org/apache/mahout/math/map/OpenHashMap.java @@ -0,0 +1,652 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +/* +Copyright � 1999 CERN - European Organization for Nuclear Research. +Permission to use, copy, modify, distribute and sell this software and its documentation for any purpose +is hereby granted without fee, provided that the above copyright notice appear in all copies and +that both that copyright notice and this permission notice appear in supporting documentation. +CERN makes no representations about the suitability of this software for any purpose. +It is provided "as is" without expressed or implied warranty. +*/ +package org.apache.mahout.math.map; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collection; +import java.util.List; +import java.util.Map; +import java.util.Set; + +import org.apache.mahout.math.function.ObjectObjectProcedure; +import org.apache.mahout.math.function.ObjectProcedure; +import org.apache.mahout.math.set.AbstractSet; +import org.apache.mahout.math.set.OpenHashSet; + +/** + * Open hash map. This implements Map, but it does not respect several aspects of the Map contract + * that impose the very sorts of performance penalities that this class exists to avoid. + * {@link #entrySet}, {@link #values}, and {@link #keySet()} do <strong>not</strong> return + * collections that share storage with the main map, and changes to those returned objects + * are <strong>not</strong> reflected in the container. + **/ +public class OpenHashMap<K,V> extends AbstractSet implements Map<K,V> { + protected static final byte FREE = 0; + protected static final byte FULL = 1; + protected static final byte REMOVED = 2; + protected static final Object NO_KEY_VALUE = null; + + /** The hash table keys. */ + protected Object[] table; + + /** The hash table values. */ + protected Object[] values; + + /** The state of each hash table entry (FREE, FULL, REMOVED). */ + protected byte[] state; + + /** The number of table entries in state==FREE. */ + protected int freeEntries; + + + /** Constructs an empty map with default capacity and default load factors. */ + public OpenHashMap() { + this(DEFAULT_CAPACITY); + } + + /** + * Constructs an empty map with the specified initial capacity and default load factors. + * + * @param initialCapacity the initial capacity of the map. + * @throws IllegalArgumentException if the initial capacity is less than zero. + */ + public OpenHashMap(int initialCapacity) { + this(initialCapacity, DEFAULT_MIN_LOAD_FACTOR, DEFAULT_MAX_LOAD_FACTOR); + } + + /** + * Constructs an empty map with the specified initial capacity and the specified minimum and maximum load factor. + * + * @param initialCapacity the initial capacity. + * @param minLoadFactor the minimum load factor. + * @param maxLoadFactor the maximum load factor. + * @throws IllegalArgumentException if <tt>initialCapacity < 0 || (minLoadFactor < 0.0 || minLoadFactor >= 1.0) || + * (maxLoadFactor <= 0.0 || maxLoadFactor >= 1.0) || (minLoadFactor >= + * maxLoadFactor)</tt>. + */ + public OpenHashMap(int initialCapacity, double minLoadFactor, double maxLoadFactor) { + setUp(initialCapacity, minLoadFactor, maxLoadFactor); + } + + /** Removes all (key,value) associations from the receiver. Implicitly calls <tt>trimToSize()</tt>. */ + @Override + public void clear() { + Arrays.fill(this.state, FREE); + distinct = 0; + freeEntries = table.length; // delta + trimToSize(); + } + + /** + * Returns a deep copy of the receiver. + * + * @return a deep copy of the receiver. + */ + @Override + @SuppressWarnings("unchecked") + public Object clone() { + OpenHashMap<K,V> copy = (OpenHashMap<K,V>) super.clone(); + copy.table = copy.table.clone(); + copy.values = copy.values.clone(); + copy.state = copy.state.clone(); + return copy; + } + + /** + * Returns <tt>true</tt> if the receiver contains the specified key. + * + * @return <tt>true</tt> if the receiver contains the specified key. + */ + @SuppressWarnings("unchecked") + @Override + public boolean containsKey(Object key) { + return indexOfKey((K)key) >= 0; + } + + /** + * Returns <tt>true</tt> if the receiver contains the specified value. + * + * @return <tt>true</tt> if the receiver contains the specified value. + */ + @SuppressWarnings("unchecked") + @Override + public boolean containsValue(Object value) { + return indexOfValue((V)value) >= 0; + } + + /** + * Ensures that the receiver can hold at least the specified number of associations without needing to allocate new + * internal memory. If necessary, allocates new internal memory and increases the capacity of the receiver. <p> This + * method never need be called; it is for performance tuning only. Calling this method before <tt>put()</tt>ing a + * large number of associations boosts performance, because the receiver will grow only once instead of potentially + * many times and hash collisions get less probable. + * + * @param minCapacity the desired minimum capacity. + */ + @Override + public void ensureCapacity(int minCapacity) { + if (table.length < minCapacity) { + int newCapacity = nextPrime(minCapacity); + rehash(newCapacity); + } + } + + /** + * Applies a procedure to each key of the receiver, if any. Note: Iterates over the keys in no particular order. + * Subclasses can define a particular order, for example, "sorted by key". All methods which <i>can</i> be expressed + * in terms of this method (most methods can) <i>must guarantee</i> to use the <i>same</i> order defined by this + * method, even if it is no particular order. This is necessary so that, for example, methods <tt>keys</tt> and + * <tt>values</tt> will yield association pairs, not two uncorrelated lists. + * + * @param procedure the procedure to be applied. Stops iteration if the procedure returns <tt>false</tt>, otherwise + * continues. + * @return <tt>false</tt> if the procedure stopped before all keys where iterated over, <tt>true</tt> otherwise. + */ + @SuppressWarnings("unchecked") + public boolean forEachKey(ObjectProcedure<K> procedure) { + for (int i = table.length; i-- > 0;) { + if (state[i] == FULL && !procedure.apply((K)table[i])) { + return false; + } + } + return true; + } + + /** + * Applies a procedure to each (key,value) pair of the receiver, if any. Iteration order is guaranteed to be + * <i>identical</i> to the order used by method {@link #forEachKey(ObjectProcedure)}. + * + * @param procedure the procedure to be applied. Stops iteration if the procedure returns <tt>false</tt>, otherwise + * continues. + * @return <tt>false</tt> if the procedure stopped before all keys where iterated over, <tt>true</tt> otherwise. + */ + @SuppressWarnings("unchecked") + public boolean forEachPair(ObjectObjectProcedure<K,V> procedure) { + for (int i = table.length; i-- > 0;) { + if (state[i] == FULL && !procedure.apply((K)table[i], (V)values[i])) { + return false; + } + } + return true; + } + + /** + * Returns the value associated with the specified key. It is often a good idea to first check with {@link + * #containsKey(Object)} whether the given key has a value associated or not, i.e. whether there exists an association + * for the given key or not. + * + * @param key the key to be searched for. + * @return the value associated with the specified key; <tt>0</tt> if no such key is present. + */ + @SuppressWarnings("unchecked") + @Override + public V get(Object key) { + int i = indexOfKey((K)key); + if (i < 0) { + return null; + } //not contained + return (V)values[i]; + } + + /** + * @param key the key to be added to the receiver. + * @return the index where the key would need to be inserted, if it is not already contained. Returns -index-1 if the + * key is already contained at slot index. Therefore, if the returned index < 0, then it is already contained + * at slot -index-1. If the returned index >= 0, then it is NOT already contained and should be inserted at + * slot index. + */ + protected int indexOfInsertion(K key) { + Object[] tab = table; + byte[] stat = state; + int length = tab.length; + + int hash = key.hashCode() & 0x7FFFFFFF; + int i = hash % length; + int decrement = hash % (length - 2); // double hashing, see http://www.eece.unm.edu/faculty/heileman/hash/node4.html + //int decrement = (hash / length) % length; + if (decrement == 0) { + decrement = 1; + } + + // stop if we find a removed or free slot, or if we find the key itself + // do NOT skip over removed slots (yes, open addressing is like that...) + while (stat[i] == FULL && !equalsMindTheNull(key, tab[i])) { + i -= decrement; + //hashCollisions++; + if (i < 0) { + i += length; + } + } + + if (stat[i] == REMOVED) { + // stop if we find a free slot, or if we find the key itself. + // do skip over removed slots (yes, open addressing is like that...) + // assertion: there is at least one FREE slot. + int j = i; + while (stat[i] != FREE && (stat[i] == REMOVED || tab[i] != key)) { + i -= decrement; + //hashCollisions++; + if (i < 0) { + i += length; + } + } + if (stat[i] == FREE) { + i = j; + } + } + + + if (stat[i] == FULL) { + // key already contained at slot i. + // return a negative number identifying the slot. + return -i - 1; + } + // not already contained, should be inserted at slot i. + // return a number >= 0 identifying the slot. + return i; + } + + /** + * @param key the key to be searched in the receiver. + * @return the index where the key is contained in the receiver, returns -1 if the key was not found. + */ + protected int indexOfKey(K key) { + Object[] tab = table; + byte[] stat = state; + int length = tab.length; + + int hash = key.hashCode() & 0x7FFFFFFF; + int i = hash % length; + int decrement = hash % (length - 2); // double hashing, see http://www.eece.unm.edu/faculty/heileman/hash/node4.html + //int decrement = (hash / length) % length; + if (decrement == 0) { + decrement = 1; + } + + // stop if we find a free slot, or if we find the key itself. + // do skip over removed slots (yes, open addressing is like that...) + while (stat[i] != FREE && (stat[i] == REMOVED || !equalsMindTheNull(key, tab[i]))) { + i -= decrement; + //hashCollisions++; + if (i < 0) { + i += length; + } + } + + if (stat[i] == FREE) { + return -1; + } // not found + return i; //found, return index where key is contained + } + + /** + * @param value the value to be searched in the receiver. + * @return the index where the value is contained in the receiver, returns -1 if the value was not found. + */ + protected int indexOfValue(V value) { + Object[] val = values; + byte[] stat = state; + + for (int i = stat.length; --i >= 0;) { + if (stat[i] == FULL && equalsMindTheNull(val[i], value)) { + return i; + } + } + + return -1; // not found + } + + /** + * Fills all keys contained in the receiver into the specified list. Fills the list, starting at index 0. After this + * call returns the specified list has a new size that equals <tt>this.size()</tt>. + * This method can be used + * to iterate over the keys of the receiver. + * + * @param list the list to be filled, can have any size. + */ + @SuppressWarnings("unchecked") + public void keys(List<K> list) { + list.clear(); + + + Object [] tab = table; + byte[] stat = state; + + for (int i = tab.length; i-- > 0;) { + if (stat[i] == FULL) { + list.add((K)tab[i]); + } + } + } + + /** + * Associates the given key with the given value. Replaces any old <tt>(key,someOtherValue)</tt> association, if + * existing. + * + * @param key the key the value shall be associated with. + * @param value the value to be associated. + * @return <tt>true</tt> if the receiver did not already contain such a key; <tt>false</tt> if the receiver did + * already contain such a key - the new value has now replaced the formerly associated value. + */ + @SuppressWarnings("unchecked") + @Override + public V put(K key, V value) { + int i = indexOfInsertion(key); + if (i < 0) { //already contained + i = -i - 1; + V previous = (V) this.values[i]; + this.values[i] = value; + return previous; + } + + if (this.distinct > this.highWaterMark) { + int newCapacity = chooseGrowCapacity(this.distinct + 1, this.minLoadFactor, this.maxLoadFactor); + rehash(newCapacity); + return put(key, value); + } + + this.table[i] = key; + this.values[i] = value; + if (this.state[i] == FREE) { + this.freeEntries--; + } + this.state[i] = FULL; + this.distinct++; + + if (this.freeEntries < 1) { //delta + int newCapacity = chooseGrowCapacity(this.distinct + 1, this.minLoadFactor, this.maxLoadFactor); + rehash(newCapacity); + } + + return null; + } + + /** + * Rehashes the contents of the receiver into a new table with a smaller or larger capacity. This method is called + * automatically when the number of keys in the receiver exceeds the high water mark or falls below the low water + * mark. + */ + @SuppressWarnings("unchecked") + protected void rehash(int newCapacity) { + int oldCapacity = table.length; + //if (oldCapacity == newCapacity) return; + + Object[] oldTable = table; + Object[] oldValues = values; + byte[] oldState = state; + + Object[] newTable = new Object[newCapacity]; + Object[] newValues = new Object[newCapacity]; + byte[] newState = new byte[newCapacity]; + + this.lowWaterMark = chooseLowWaterMark(newCapacity, this.minLoadFactor); + this.highWaterMark = chooseHighWaterMark(newCapacity, this.maxLoadFactor); + + this.table = newTable; + this.values = newValues; + this.state = newState; + this.freeEntries = newCapacity - this.distinct; // delta + + for (int i = oldCapacity; i-- > 0;) { + if (oldState[i] == FULL) { + Object element = oldTable[i]; + int index = indexOfInsertion((K)element); + newTable[index] = element; + newValues[index] = oldValues[i]; + newState[index] = FULL; + } + } + } + + /** + * Removes the given key with its associated element from the receiver, if present. + * + * @param key the key to be removed from the receiver. + * @return <tt>true</tt> if the receiver contained the specified key, <tt>false</tt> otherwise. + */ + @SuppressWarnings("unchecked") + @Override + public V remove(Object key) { + int i = indexOfKey((K)key); + if (i < 0) { + return null; + } + // key not contained + V removed = (V) values[i]; + + this.state[i] = REMOVED; + //this.values[i]=0; // delta + this.distinct--; + + if (this.distinct < this.lowWaterMark) { + int newCapacity = chooseShrinkCapacity(this.distinct, this.minLoadFactor, this.maxLoadFactor); + rehash(newCapacity); + } + + return removed; + } + + /** + * Initializes the receiver. + * + * @param initialCapacity the initial capacity of the receiver. + * @param minLoadFactor the minLoadFactor of the receiver. + * @param maxLoadFactor the maxLoadFactor of the receiver. + * @throws IllegalArgumentException if <tt>initialCapacity < 0 || (minLoadFactor < 0.0 || minLoadFactor >= 1.0) || + * (maxLoadFactor <= 0.0 || maxLoadFactor >= 1.0) || (minLoadFactor >= + * maxLoadFactor)</tt>. + */ + @Override + protected void setUp(int initialCapacity, double minLoadFactor, double maxLoadFactor) { + int capacity = initialCapacity; + super.setUp(capacity, minLoadFactor, maxLoadFactor); + capacity = nextPrime(capacity); + if (capacity == 0) { + capacity = 1; + } // open addressing needs at least one FREE slot at any time. + + this.table = new Object[capacity]; + this.values = new Object[capacity]; + this.state = new byte[capacity]; + + // memory will be exhausted long before this pathological case happens, anyway. + this.minLoadFactor = minLoadFactor; + if (capacity == PrimeFinder.LARGEST_PRIME) { + this.maxLoadFactor = 1.0; + } else { + this.maxLoadFactor = maxLoadFactor; + } + + this.distinct = 0; + this.freeEntries = capacity; // delta + + // lowWaterMark will be established upon first expansion. + // establishing it now (upon instance construction) would immediately make the table shrink upon first put(...). + // After all the idea of an "initialCapacity" implies violating lowWaterMarks when an object is young. + // See ensureCapacity(...) + this.lowWaterMark = 0; + this.highWaterMark = chooseHighWaterMark(capacity, this.maxLoadFactor); + } + + /** + * Trims the capacity of the receiver to be the receiver's current size. Releases any superfluous internal memory. An + * application can use this operation to minimize the storage of the receiver. + */ + @Override + public void trimToSize() { + // * 1.2 because open addressing's performance exponentially degrades beyond that point + // so that even rehashing the table can take very long + int newCapacity = nextPrime((int) (1 + 1.2 * size())); + if (table.length > newCapacity) { + rehash(newCapacity); + } + } + + /** + * Access for unit tests. + * @param capacity + * @param minLoadFactor + * @param maxLoadFactor + */ + void getInternalFactors(int[] capacity, + double[] minLoadFactor, + double[] maxLoadFactor) { + capacity[0] = table.length; + minLoadFactor[0] = this.minLoadFactor; + maxLoadFactor[0] = this.maxLoadFactor; + } + + private class MapEntry implements Map.Entry<K,V> { + private final K key; + private final V value; + + MapEntry(K key, V value) { + this.key = key; + this.value = value; + } + + @Override + public K getKey() { + return key; + } + + @Override + public V getValue() { + return value; + } + + @Override + public V setValue(V value) { + throw new UnsupportedOperationException("Map.Entry.setValue not supported for OpenHashMap"); + } + + } + + /** + * Allocate a set to contain Map.Entry objects for the pairs and return it. + */ + @Override + public Set<java.util.Map.Entry<K,V>> entrySet() { + final Set<Entry<K, V>> entries = new OpenHashSet<>(); + forEachPair(new ObjectObjectProcedure<K,V>() { + @Override + public boolean apply(K key, V value) { + entries.add(new MapEntry(key, value)); + return true; + } + }); + return entries; + } + + /** + * Allocate a set to contain keys and return it. + * This violates the 'backing' provisions of the map interface. + */ + @Override + public Set<K> keySet() { + final Set<K> keys = new OpenHashSet<>(); + forEachKey(new ObjectProcedure<K>() { + @Override + public boolean apply(K element) { + keys.add(element); + return true; + } + }); + return keys; + } + + @Override + public void putAll(Map<? extends K,? extends V> m) { + for (Map.Entry<? extends K, ? extends V> e : m.entrySet()) { + put(e.getKey(), e.getValue()); + } + } + + /** + * Allocate a list to contain the values and return it. + * This violates the 'backing' provision of the Map interface. + */ + @Override + public Collection<V> values() { + final List<V> valueList = new ArrayList<>(); + forEachPair(new ObjectObjectProcedure<K,V>() { + @Override + public boolean apply(K key, V value) { + valueList.add(value); + return true; + } + }); + return valueList; + } + + @SuppressWarnings("unchecked") + @Override + public boolean equals(Object obj) { + if (!(obj instanceof OpenHashMap)) { + return false; + } + final OpenHashMap<K,V> o = (OpenHashMap<K,V>) obj; + if (o.size() != size()) { + return false; + } + final boolean[] equal = new boolean[1]; + equal[0] = true; + forEachPair(new ObjectObjectProcedure<K,V>() { + @Override + public boolean apply(K key, V value) { + Object ov = o.get(key); + if (!value.equals(ov)) { + equal[0] = false; + return false; + } + return true; + } + }); + return equal[0]; + } + + @Override + public String toString() { + final StringBuilder sb = new StringBuilder(); + sb.append('{'); + forEachPair(new ObjectObjectProcedure<K,V>() { + @Override + public boolean apply(K key, V value) { + sb.append('['); + sb.append(key); + sb.append(" -> "); + sb.append(value); + sb.append("] "); + return true; + } + }); + sb.append('}'); + return sb.toString(); + } +} http://git-wip-us.apache.org/repos/asf/mahout/blob/545648f6/core/src/main/java/org/apache/mahout/math/map/PrimeFinder.java ---------------------------------------------------------------------- diff --git a/core/src/main/java/org/apache/mahout/math/map/PrimeFinder.java b/core/src/main/java/org/apache/mahout/math/map/PrimeFinder.java new file mode 100644 index 0000000..b02611e --- /dev/null +++ b/core/src/main/java/org/apache/mahout/math/map/PrimeFinder.java @@ -0,0 +1,145 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.mahout.math.map; + +import java.util.Arrays; + +/** + * Not of interest for users; only for implementors of hashtables. + * Used to keep hash table capacities prime numbers. + * + * <p>Choosing prime numbers as hash table capacities is a good idea to keep them working fast, + * particularly under hash table expansions. + * + * <p>However, JDK 1.2, JGL 3.1 and many other toolkits do nothing to keep capacities prime. + * This class provides efficient means to choose prime capacities. + * + * <p>Choosing a prime is <tt>O(log 300)</tt> (binary search in a list of 300 int's). + * Memory requirements: 1 KB static memory. + * + */ +public final class PrimeFinder { + + /** The largest prime this class can generate; currently equal to <tt>Integer.MAX_VALUE</tt>. */ + public static final int LARGEST_PRIME = Integer.MAX_VALUE; //yes, it is prime. + + /** + * The prime number list consists of 11 chunks. Each chunk contains prime numbers. A chunk starts with a prime P1. The + * next element is a prime P2. P2 is the smallest prime for which holds: P2 >= 2*P1. The next element is P3, for which + * the same holds with respect to P2, and so on. + * + * Chunks are chosen such that for any desired capacity >= 1000 the list includes a prime number <= desired capacity * + * 1.11 (11%). For any desired capacity >= 200 the list includes a prime number <= desired capacity * 1.16 (16%). For + * any desired capacity >= 16 the list includes a prime number <= desired capacity * 1.21 (21%). + * + * Therefore, primes can be retrieved which are quite close to any desired capacity, which in turn avoids wasting + * memory. For example, the list includes 1039,1117,1201,1277,1361,1439,1523,1597,1759,1907,2081. So if you need a + * prime >= 1040, you will find a prime <= 1040*1.11=1154. + * + * Chunks are chosen such that they are optimized for a hashtable growthfactor of 2.0; If your hashtable has such a + * growthfactor then, after initially "rounding to a prime" upon hashtable construction, it will later expand to prime + * capacities such that there exist no better primes. + * + * In total these are about 32*10=320 numbers -> 1 KB of static memory needed. If you are stingy, then delete every + * second or fourth chunk. + */ + + private static final int[] PRIME_CAPACITIES = { + //chunk #0 + LARGEST_PRIME, + + //chunk #1 + 5, 11, 23, 47, 97, 197, 397, 797, 1597, 3203, 6421, 12853, 25717, 51437, 102877, 205759, + 411527, 823117, 1646237, 3292489, 6584983, 13169977, 26339969, 52679969, 105359939, + 210719881, 421439783, 842879579, 1685759167, + + //chunk #2 + 433, 877, 1759, 3527, 7057, 14143, 28289, 56591, 113189, 226379, 452759, 905551, 1811107, + 3622219, 7244441, 14488931, 28977863, 57955739, 115911563, 231823147, 463646329, 927292699, + 1854585413, + + //chunk #3 + 953, 1907, 3821, 7643, 15287, 30577, 61169, 122347, 244703, 489407, 978821, 1957651, 3915341, + 7830701, 15661423, 31322867, 62645741, 125291483, 250582987, 501165979, 1002331963, + 2004663929, + + //chunk #4 + 1039, 2081, 4177, 8363, 16729, 33461, 66923, 133853, 267713, 535481, 1070981, 2141977, 4283963, + 8567929, 17135863, 34271747, 68543509, 137087021, 274174111, 548348231, 1096696463, + + //chunk #5 + 31, 67, 137, 277, 557, 1117, 2237, 4481, 8963, 17929, 35863, 71741, 143483, 286973, 573953, + 1147921, 2295859, 4591721, 9183457, 18366923, 36733847, 73467739, 146935499, 293871013, + 587742049, 1175484103, + + //chunk #6 + 599, 1201, 2411, 4831, 9677, 19373, 38747, 77509, 155027, 310081, 620171, 1240361, 2480729, + 4961459, 9922933, 19845871, 39691759, 79383533, 158767069, 317534141, 635068283, 1270136683, + + //chunk #7 + 311, 631, 1277, 2557, 5119, 10243, 20507, 41017, 82037, 164089, 328213, 656429, 1312867, + 2625761, 5251529, 10503061, 21006137, 42012281, 84024581, 168049163, 336098327, 672196673, + 1344393353, + + //chunk #8 + 3, 7, 17, 37, 79, 163, 331, 673, 1361, 2729, 5471, 10949, 21911, 43853, 87719, 175447, 350899, + 701819, 1403641, 2807303, 5614657, 11229331, 22458671, 44917381, 89834777, 179669557, + 359339171, 718678369, 1437356741, + + //chunk #9 + 43, 89, 179, 359, 719, 1439, 2879, 5779, 11579, 23159, 46327, 92657, 185323, 370661, 741337, + 1482707, 2965421, 5930887, 11861791, 23723597, 47447201, 94894427, 189788857, 379577741, + 759155483, 1518310967, + + //chunk #10 + 379, 761, 1523, 3049, 6101, 12203, 24407, 48817, 97649, 195311, 390647, 781301, 1562611, + 3125257, 6250537, 12501169, 25002389, 50004791, 100009607, 200019221, 400038451, 800076929, + 1600153859 + }; + + + static { //initializer + // The above prime numbers are formatted for human readability. + // To find numbers fast, we sort them once and for all. + + Arrays.sort(PRIME_CAPACITIES); + } + + /** Makes this class non instantiable, but still let's others inherit from it. */ + private PrimeFinder() { + } + + /** + * Returns a prime number which is {@code <= desiredCapacity} and very close to {@code desiredCapacity} + * (within 11% if {@code desiredCapacity <= 1000}). + * + * @param desiredCapacity the capacity desired by the user. + * @return the capacity which should be used for a hashtable. + */ + public static int nextPrime(int desiredCapacity) { + int i = java.util.Arrays.binarySearch(PRIME_CAPACITIES, desiredCapacity); + if (i < 0) { + // desired capacity not found, choose next prime greater than desired capacity + i = -i - 1; // remember the semantics of binarySearch... + } + return PRIME_CAPACITIES[i]; + } + +} http://git-wip-us.apache.org/repos/asf/mahout/blob/545648f6/core/src/main/java/org/apache/mahout/math/map/QuickOpenIntIntHashMap.java ---------------------------------------------------------------------- diff --git a/core/src/main/java/org/apache/mahout/math/map/QuickOpenIntIntHashMap.java b/core/src/main/java/org/apache/mahout/math/map/QuickOpenIntIntHashMap.java new file mode 100644 index 0000000..6a7cef8 --- /dev/null +++ b/core/src/main/java/org/apache/mahout/math/map/QuickOpenIntIntHashMap.java @@ -0,0 +1,215 @@ +/* +Copyright � 1999 CERN - European Organization for Nuclear Research. +Permission to use, copy, modify, distribute and sell this software and its documentation for any purpose +is hereby granted without fee, provided that the above copyright notice appear in all copies and +that both that copyright notice and this permission notice appear in supporting documentation. +CERN makes no representations about the suitability of this software for any purpose. +It is provided "as is" without expressed or implied warranty. +*/ +package org.apache.mahout.math.map; + +/** + * Status: Experimental; Do not use for production yet. Hash map holding (key,value) associations of type + * <tt>(int-->int)</tt>; Automatically grows and shrinks as needed; Implemented using open addressing with double + * hashing. First see the <a href="package-summary.html">package summary</a> and javadoc <a + * href="package-tree.html">tree view</a> to get the broad picture. + * + * Implements open addressing with double hashing, using "Brent's variation". Brent's variation slows insertions a bit + * down (not much) but reduces probes (collisions) for successful searches, in particular for large load factors. (It + * does not improve unsuccessful searches.) See D. Knuth, Searching and Sorting, 3rd ed., p.533-545 + * + * @author [email protected] + * @version 1.0, 09/24/99 + * @see java.util.HashMap + */ +class QuickOpenIntIntHashMap extends OpenIntIntHashMap { + //public int totalProbesSaved = 0; // benchmark only + + /** Constructs an empty map with default capacity and default load factors. */ + QuickOpenIntIntHashMap() { + this(DEFAULT_CAPACITY); + } + + /** + * Constructs an empty map with the specified initial capacity and default load factors. + * + * @param initialCapacity the initial capacity of the map. + * @throws IllegalArgumentException if the initial capacity is less than zero. + */ + QuickOpenIntIntHashMap(int initialCapacity) { + this(initialCapacity, DEFAULT_MIN_LOAD_FACTOR, DEFAULT_MAX_LOAD_FACTOR); + } + + /** + * Constructs an empty map with the specified initial capacity and the specified minimum and maximum load factor. + * + * @param initialCapacity the initial capacity. + * @param minLoadFactor the minimum load factor. + * @param maxLoadFactor the maximum load factor. + * @throws IllegalArgumentException if <tt>initialCapacity < 0 || (minLoadFactor < 0.0 || minLoadFactor >= 1.0) || + * (maxLoadFactor <= 0.0 || maxLoadFactor >= 1.0) || (minLoadFactor >= + * maxLoadFactor)</tt>. + */ + QuickOpenIntIntHashMap(int initialCapacity, double minLoadFactor, double maxLoadFactor) { + setUp(initialCapacity, minLoadFactor, maxLoadFactor); + } + + /** + * Associates the given key with the given value. Replaces any old <tt>(key,someOtherValue)</tt> association, if + * existing. + * + * @param key the key the value shall be associated with. + * @param value the value to be associated. + * @return <tt>true</tt> if the receiver did not already contain such a key; <tt>false</tt> if the receiver did + * already contain such a key - the new value has now replaced the formerly associated value. + */ + @Override + public boolean put(int key, int value) { + /* + This is open addressing with double hashing, using "Brent's variation". + Brent's variation slows insertions a bit down (not much) but reduces probes (collisions) for successful searches, + in particular for large load factors. + (It does not improve unsuccessful searches.) + See D. Knuth, Searching and Sorting, 3rd ed., p.533-545 + + h1(key) = hash % M + h2(key) = decrement = Max(1, hash/M % M) + M is prime = capacity = table.length + probing positions are table[(h1-j*h2) % M] for j=0,1,... + (M and h2 could also be chosen differently, but h2 is required to be relative prime to M.) + */ + + int[] tab = table; + byte[] stat = state; + int length = tab.length; + + int hash = HashFunctions.hash(key) & 0x7FFFFFFF; + int i = hash % length; + int decrement = (hash / length) % length; + if (decrement == 0) { + decrement = 1; + } + + // stop if we find a removed or free slot, or if we find the key itself + // do NOT skip over removed slots (yes, open addressing is like that...) + //int comp = comparisons; + int t = 0; // the number of probes + int p0 = i; // the first position to probe + while (stat[i] == FULL && tab[i] != key) { + t++; + i -= decrement; + //hashCollisions++; + if (i < 0) { + i += length; + } + } + if (stat[i] == FULL) { + // key already contained at slot i. + this.values[i] = value; + return false; + } + // not already contained, should be inserted at slot i. + + if (this.distinct > this.highWaterMark) { + int newCapacity = chooseGrowCapacity(this.distinct + 1, this.minLoadFactor, this.maxLoadFactor); + rehash(newCapacity); + return put(key, value); + } + + /* + Brent's variation does a local reorganization to reduce probes. It essentially means: + We test whether it is possible to move the association we probed first (table[p0]) out of the way. + If this is possible, it will reduce probes for the key to be inserted, since it takes its place; + it gets hit earlier. + However, future probes for the key that we move out of the way will increase. + Thus we only move it out of the way, if we have a net gain, that is, if we save more probes than we loose. + For the first probe we safe more than we loose if the number of probes we needed was >=2 (t>=2). + If the first probe cannot be moved out of the way, we try the next probe (p1). + Now we safe more than we loose if t>=3. + We repeat this until we find that we cannot gain or that we can indeed move p(x) out of the way. + + Note: Under the great majority of insertions t<=1, so the loop is entered very infrequently. + */ + while (t > 1) { + int key0 = tab[p0]; + hash = HashFunctions.hash(key0) & 0x7FFFFFFF; + decrement = (hash / length) % length; + if (decrement == 0) { + decrement = 1; + } + int pc = p0 - decrement; // pc = (p0-j*decrement) % M, j=1,2,.. + if (pc < 0) { + pc += length; + } + + if (stat[pc] != FREE) { // not a free slot, continue searching for free slot to move to, or break. + p0 = pc; + t--; + } else { // free or removed slot found, now move... + tab[pc] = key0; + stat[pc] = FULL; + values[pc] = values[p0]; + i = p0; // prepare to insert: table[p0]=key + t = 0; // break loop + } + } + + this.table[i] = key; + this.values[i] = value; + if (this.state[i] == FREE) { + this.freeEntries--; + } + this.state[i] = FULL; + this.distinct++; + + if (this.freeEntries < 1) { //delta + int newCapacity = chooseGrowCapacity(this.distinct + 1, this.minLoadFactor, this.maxLoadFactor); + rehash(newCapacity); + } + + return true; + } + + /** + * Rehashes the contents of the receiver into a new table with a smaller or larger capacity. This method is called + * automatically when the number of keys in the receiver exceeds the high water mark or falls below the low water + * mark. + */ + @Override + protected void rehash(int newCapacity) { + int oldCapacity = table.length; + //if (oldCapacity == newCapacity) return; + + int[] oldTable = table; + int[] oldValues = values; + byte[] oldState = state; + + int[] newTable = new int[newCapacity]; + int[] newValues = new int[newCapacity]; + byte[] newState = new byte[newCapacity]; + + this.lowWaterMark = chooseLowWaterMark(newCapacity, this.minLoadFactor); + this.highWaterMark = chooseHighWaterMark(newCapacity, this.maxLoadFactor); + + this.table = newTable; + this.values = newValues; + this.state = newState; + this.freeEntries = newCapacity - this.distinct; // delta + + int tmp = this.distinct; + this.distinct = Integer.MIN_VALUE; // switch of watermarks + for (int i = oldCapacity; i-- > 0;) { + if (oldState[i] == FULL) { + put(oldTable[i], oldValues[i]); + /* + int element = oldTable[i]; + int index = indexOfInsertion(element); + newTable[index]=element; + newValues[index]=oldValues[i]; + newState[index]=FULL; + */ + } + } + this.distinct = tmp; + } +} http://git-wip-us.apache.org/repos/asf/mahout/blob/545648f6/core/src/main/java/org/apache/mahout/math/map/package-info.java ---------------------------------------------------------------------- diff --git a/core/src/main/java/org/apache/mahout/math/map/package-info.java b/core/src/main/java/org/apache/mahout/math/map/package-info.java new file mode 100644 index 0000000..9356f45 --- /dev/null +++ b/core/src/main/java/org/apache/mahout/math/map/package-info.java @@ -0,0 +1,250 @@ +/** + * <HTML> + * <BODY> + * Automatically growing and shrinking maps holding objects or primitive + * data types such as <tt>int</tt>, <tt>double</tt>, etc. Currently all maps are + * based upon hashing. + * <h2><a name="Overview"></a>1. Overview</h2> + * <p>The map package offers flexible object oriented abstractions modelling automatically + * resizing maps. It is designed to be scalable in terms of performance and memory + * requirements.</p> + * <p>Features include: </p> + * <p></p> + * <ul> + * <li>Maps operating on objects as well as all primitive data types such as <code>int</code>, + * <code>double</code>, etc. + * </li> + * <li>Compact representations</li> + * <li>Support for quick access to associations</li> + * <li>A number of general purpose map operations</li> + * </ul> + * <p>File-based I/O can be achieved through the standard Java built-in serialization + * mechanism. All classes implement the {@link java.io.Serializable} interface. + * However, the toolkit is entirely decoupled from advanced I/O. It provides data + * structures and algorithms only. + * <p> This toolkit borrows some terminology from the Javasoft <a + * href="http://www.javasoft.com/products/jdk/1.2/docs/guide/collections/index.html"> + * Collections framework</a> written by Josh Bloch and introduced in JDK 1.2. + * <h2>2. Introduction</h2> + * <p>A map is an associative container that manages a set of (key,value) pairs. + * It is useful for implementing a collection of one-to-one mappings. A (key,value) + * pair is called an <i>association</i>. A value can be looked up up via its key. + * Associations can quickly be set, removed and retrieved. They are stored in a + * hashing structure based on the hash code of their keys, which is obtained by + * using a hash function. </p> + * <p> A map can, for example, contain <tt>Name-->Location</tt> associations like + * <tt>{("Pete", "Geneva"), ("Steve", "Paris"), ("Robert", "New York")}</tt> used + * in address books or <tt>Index-->Value</tt> mappings like <tt>{(0, 100), (3, + * 1000), (100000, 70)}</tt> representing sparse lists or matrices. For example + * this could mean at index 0 we have a value of 100, at index 3 we have a value + * of 1000, at index 1000000 we have a value of 70, and at all other indexes we + * have a value of, say, zero. Another example is a map of IP addresses to domain + * names (DNS). Maps can also be useful to represent<i> multi sets</i>, that is, + * sets where elements can occur more than once. For multi sets one would have + * <tt>Value-->Frequency</tt> mappings like <tt>{(100, 1), (50, 1000), (101, 3))}</tt> + * meaning element 100 occurs 1 time, element 50 occurs 1000 times, element 101 + * occurs 3 times. Further, maps can also manage <tt>ObjectIdentifier-->Object</tt> + * mappings like <tt>{(12, obj1), (7, obj2), (10000, obj3), (9, obj4)}</tt> used + * in Object Databases. + * <p> A map cannot contain two or more <i>equal</i> keys; a key can map to at most + * one value. However, more than one key can map to identical values. For primitive + * data types "equality" of keys is defined as identity (operator <tt>==</tt>). + * For maps using <tt>Object</tt> keys, the meaning of "equality" can be specified + * by the user upon instance construction. It can either be defined to be identity + * (operator <tt>==</tt>) or to be given by the method {@link java.lang.Object#equals(Object)}. + * Associations of kind <tt>(AnyType,Object)</tt> can be of the form <tt>(AnyKey,null) + * </tt>, i.e. values can be <tt>null</tt>. + * <p> The classes of this package make no guarantees as to the order of the elements + * returned by iterators; in particular, they do not guarantee that the order will + * remain constant over time. + * <h2></h2> + * <h4>Copying</h4> + * <p> + * <p>Any map can be copied. A copy is <i>equal</i> to the original but entirely + * independent of the original. So changes in the copy are not reflected in the + * original, and vice-versa. + * <h2>3. Package organization</h2> + * <p>For most primitive data types and for objects there exists a separate map version. + * All versions are just the same, except that they operate on different data types. + * Colt includes two kinds of implementations for maps: The two different implementations + * are tagged <b>Chained</b> and <b>Open</b>. + * Note: Chained is no more included. Wherever it is mentioned it is of historic interest only.</p> + * <ul> + * <li><b>Chained</b> uses extendible separate chaining with chains holding unsorted + * dynamically linked collision lists. + * <li><b>Open</b> uses extendible open addressing with double hashing. + * </ul> + * <p>Class naming follows the schema <tt><Implementation><KeyType><ValueType>HashMap</tt>. + * For example, a {@link org.apache.mahout.math.map.OpenIntDoubleHashMap} holds <tt>(int-->double)</tt> + * associations and is implemented with open addressing. A {@link org.apache.mahout.math.map.OpenIntObjectHashMap} + * holds <tt>(int-->Object)</tt> associations and is implemented with open addressing. + * </p> + * <p>The classes for maps of a given (key,value) type are derived from a common + * abstract base class tagged <tt>Abstract<KeyType><ValueType></tt><tt>Map</tt>. + * For example, all maps operating on <tt>(int-->double)</tt> associations are + * derived from {@link org.apache.mahout.math.map.AbstractIntDoubleMap}, which in turn is derived + * from an abstract base class tying together all maps regardless of assocation + * type, {@link org.apache.mahout.math.set.AbstractSet}. The abstract base classes provide skeleton + * implementations for all but few methods. Experimental layouts (such as chaining, + * open addressing, extensible hashing, red-black-trees, etc.) can easily be implemented + * and inherit a rich set of functionality. Have a look at the javadoc <a href="package-tree.html">tree + * view</a> to get the broad picture.</p> + * <h2>4. Example usage</h2> + * <TABLE> + * <TD CLASS="PRE"> + * <PRE> + * int[] keys = {0 , 3 , 100000, 9 }; + * double[] values = {100.0, 1000.0, 70.0 , 71.0}; + * AbstractIntDoubleMap map = new OpenIntDoubleHashMap(); + * // add several associations + * for (int i=0; i < keys.length; i++) map.put(keys[i], values[i]); + * log.info("map="+map); + * log.info("size="+map.size()); + * log.info(map.containsKey(3)); + * log.info("get(3)="+map.get(3)); + * log.info(map.containsKey(4)); + * log.info("get(4)="+map.get(4)); + * log.info(map.containsValue(71.0)); + * log.info("keyOf(71.0)="+map.keyOf(71.0)); + * // remove one association + * map.removeKey(3); + * log.info("\nmap="+map); + * log.info(map.containsKey(3)); + * log.info("get(3)="+map.get(3)); + * log.info(map.containsValue(1000.0)); + * log.info("keyOf(1000.0)="+map.keyOf(1000.0)); + * // clear + * map.clear(); + * log.info("\nmap="+map); + * log.info("size="+map.size()); + * </PRE> + * </TD> + * </TABLE> + * yields the following output + * <TABLE> + * <TD CLASS="PRE"> + * <PRE> + * map=[0->100.0, 3->1000.0, 9->71.0, 100000->70.0] + * size=4 + * true + * get(3)=1000.0 + * false + * get(4)=0.0 + * true + * keyOf(71.0)=9 + * map=[0->100.0, 9->71.0, 100000->70.0] + * false + * get(3)=0.0 + * false + * keyOf(1000.0)=-2147483648 + * map=[] + * size=0 + * </PRE> + * </TD> + * </TABLE> + * <h2> 5. Notes </h2> + * <p> + * Note that implementations are not synchronized. + * <p> + * Choosing efficient parameters for hash maps is not always easy. + * However, since parameters determine efficiency and memory requirements, here is a quick guide how to choose them. + * If your use case does not heavily operate on hash maps but uses them just because they provide + * convenient functionality, you can safely skip this section. + * For those of you who care, read on. + * <p> + * There are three parameters that can be customized upon map construction: <tt>initialCapacity</tt>, + * <tt>minLoadFactor</tt> and <tt>maxLoadFactor</tt>. + * The more memory one can afford, the faster a hash map. + * The hash map's capacity is the maximum number of associations that can be added without needing to allocate new + * internal memory. + * A larger capacity means faster adding, searching and removing. + * The <tt>initialCapacity</tt> corresponds to the capacity used upon instance construction. + * <p> + * The <tt>loadFactor</tt> of a hash map measures the degree of "fullness". + * It is given by the number of assocations (<tt>size()</tt>) + * divided by the hash map capacity <tt>(0.0 <= loadFactor <= 1.0)</tt>. + * The more associations are added, the larger the loadFactor and the more hash map performance degrades. + * Therefore, when the loadFactor exceeds a customizable threshold (<tt>maxLoadFactor</tt>), the hash map is + * automatically grown. + * In such a way performance degradation can be avoided. + * Similarly, when the loadFactor falls below a customizable threshold (<tt>minLoadFactor</tt>), the hash map is + * automatically shrinked. + * In such a way excessive memory consumption can be avoided. + * Automatic resizing (both growing and shrinking) obeys the following invariant: + * <p> + * <tt>capacity * minLoadFactor <= size() <= capacity * maxLoadFactor</tt> + * <p> The term <tt>capacity * minLoadFactor</tt> is called the <i>low water mark</i>, + * <tt>capacity * maxLoadFactor</tt> is called the <i>high water mark</i>. In other + * words, the number of associations may vary within the water mark constraints. + * When it goes out of range, the map is automatically resized and memory consumption + * changes proportionally. + * <ul> + * <li>To tune for memory at the expense of performance, both increase <tt>minLoadFactor</tt> and + * <tt>maxLoadFactor</tt>. + * <li>To tune for performance at the expense of memory, both decrease <tt>minLoadFactor</tt> and + * <tt>maxLoadFactor</tt>. + * As as special case set <tt>minLoadFactor=0</tt> to avoid any automatic shrinking. + * </ul> + * Resizing large hash maps can be time consuming, <tt>O(size())</tt>, and should be avoided if possible (maintaining + * primes is not the reason). + * Unnecessary growing operations can be avoided if the number of associations is known before they are added, or can be + * estimated.<p> + * In such a case good parameters are as follows: + * <p> + * <i>For chaining:</i> + * <br>Set the <tt>initialCapacity = 1.4*expectedSize</tt> or greater. + * <br>Set the <tt>maxLoadFactor = 0.8</tt> or greater. + * <p> + * <i>For open addressing:</i> + * <br>Set the <tt>initialCapacity = 2*expectedSize</tt> or greater. Alternatively call <tt>ensureCapacity(...)</tt>. + * <br>Set the <tt>maxLoadFactor = 0.5</tt>. + * <br>Never set <tt>maxLoadFactor > 0.55</tt>; open addressing exponentially slows down beyond that point. + * <p> + * In this way the hash map will never need to grow and still stay fast. + * It is never a good idea to set <tt>maxLoadFactor < 0.1</tt>, + * because the hash map would grow too often. + * If it is entirelly unknown how many associations the application will use, + * the default constructor should be used. The map will grow and shrink as needed. + * <p> + * <b>Comparision of chaining and open addressing</b> + * <p> Chaining is faster than open addressing, when assuming unconstrained memory + * consumption. Open addressing is more space efficient than chaining, because + * it does not create entry objects but uses primitive arrays which are considerably + * smaller. Entry objects consume significant amounts of memory compared to the + * information they actually hold. Open addressing also poses no problems to the + * garbage collector. In contrast, chaining can create millions of entry objects + * which are linked; a nightmare for any garbage collector. In addition, entry + * object creation is a bit slow. <br> + * Therefore, with the same amount of memory, or even less memory, hash maps with + * larger capacity can be maintained under open addressing, which yields smaller + * loadFactors, which in turn keeps performance competitive with chaining. In our + * benchmarks, using significantly less memory, open addressing usually is not + * more than 1.2-1.5 times slower than chaining. + * <p><b>Further readings</b>: + * <br>Knuth D., The Art of Computer Programming: Searching and Sorting, 3rd ed. + * <br>Griswold W., Townsend G., The Design and Implementation of Dynamic Hashing for Sets and Tables in Icon, + * Software - Practice and Experience, Vol. 23(4), 351-367 (April 1993). + * <br>Larson P., Dynamic hash tables, Comm. of the ACM, 31, (4), 1988. + * <p> + * <b>Performance:</b> + * <p> + * Time complexity: + * <br>The classes offer <i>expected</i> time complexity <tt>O(1)</tt> (i.e. constant time) for the basic operations + * <tt>put</tt>, <tt>get</tt>, <tt>removeKey</tt>, <tt>containsKey</tt> and <tt>size</tt>, + * assuming the hash function disperses the elements properly among the buckets. + * Otherwise, pathological cases, although highly improbable, can occur, degrading performance to <tt>O(N)</tt> in the + * worst case. + * Operations <tt>containsValue</tt> and <tt>keyOf</tt> are <tt>O(N)</tt>. + * <p> + * Memory requirements for <i>open addressing</i>: + * <br>worst case: <tt>memory [bytes] = (1/minLoadFactor) * size() * (1 + sizeOf(key) + sizeOf(value))</tt>. + * <br>best case: <tt>memory [bytes] = (1/maxLoadFactor) * size() * (1 + sizeOf(key) + sizeOf(value))</tt>. + * Where <tt>sizeOf(int) = 4</tt>, <tt>sizeOf(double) = 8</tt>, <tt>sizeOf(Object) = 4</tt>, etc. + * Thus, an <tt>OpenIntIntHashMap</tt> with minLoadFactor=0.25 and maxLoadFactor=0.5 and 1000000 associations uses + * between 17 MB and 34 MB. + * The same map with 1000 associations uses between 17 and 34 KB. + * <p> + * </BODY> + * </HTML> + */ +package org.apache.mahout.math.map; http://git-wip-us.apache.org/repos/asf/mahout/blob/545648f6/core/src/main/java/org/apache/mahout/math/package-info.java ---------------------------------------------------------------------- diff --git a/core/src/main/java/org/apache/mahout/math/package-info.java b/core/src/main/java/org/apache/mahout/math/package-info.java new file mode 100644 index 0000000..de664f0 --- /dev/null +++ b/core/src/main/java/org/apache/mahout/math/package-info.java @@ -0,0 +1,4 @@ +/** + * Core base classes; Operations on primitive arrays such as sorting, partitioning and permuting. + */ +package org.apache.mahout.math; http://git-wip-us.apache.org/repos/asf/mahout/blob/545648f6/core/src/main/java/org/apache/mahout/math/random/AbstractSamplerFunction.java ---------------------------------------------------------------------- diff --git a/core/src/main/java/org/apache/mahout/math/random/AbstractSamplerFunction.java b/core/src/main/java/org/apache/mahout/math/random/AbstractSamplerFunction.java new file mode 100644 index 0000000..d657fd9 --- /dev/null +++ b/core/src/main/java/org/apache/mahout/math/random/AbstractSamplerFunction.java @@ -0,0 +1,39 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.math.random; + +import org.apache.mahout.math.function.DoubleFunction; + +/** + * This shim allows samplers to be used to initialize vectors. + */ +public abstract class AbstractSamplerFunction extends DoubleFunction implements Sampler<Double> { + /** + * Apply the function to the argument and return the result + * + * @param ignored Ignored argument + * @return A sample from this distribution. + */ + @Override + public double apply(double ignored) { + return sample(); + } + + @Override + public abstract Double sample(); +} http://git-wip-us.apache.org/repos/asf/mahout/blob/545648f6/core/src/main/java/org/apache/mahout/math/random/ChineseRestaurant.java ---------------------------------------------------------------------- diff --git a/core/src/main/java/org/apache/mahout/math/random/ChineseRestaurant.java b/core/src/main/java/org/apache/mahout/math/random/ChineseRestaurant.java new file mode 100644 index 0000000..8127b92 --- /dev/null +++ b/core/src/main/java/org/apache/mahout/math/random/ChineseRestaurant.java @@ -0,0 +1,111 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.math.random; + +import com.google.common.base.Preconditions; +import org.apache.mahout.common.RandomUtils; +import org.apache.mahout.math.list.DoubleArrayList; + +import java.util.Random; + +/** + * + * Generates samples from a generalized Chinese restaurant process (or Pittman-Yor process). + * + * The number of values drawn exactly once will asymptotically be equal to the discount parameter + * as the total number of draws T increases without bound. The number of unique values sampled will + * increase as O(alpha * log T) if discount = 0 or O(alpha * T^discount) for discount > 0. + */ +public final class ChineseRestaurant implements Sampler<Integer> { + + private final double alpha; + private double weight = 0; + private double discount = 0; + private final DoubleArrayList weights = new DoubleArrayList(); + private final Random rand = RandomUtils.getRandom(); + + /** + * Constructs a Dirichlet process sampler. This is done by setting discount = 0. + * @param alpha The strength parameter for the Dirichlet process. + */ + public ChineseRestaurant(double alpha) { + this(alpha, 0); + } + + /** + * Constructs a Pitman-Yor sampler. + * + * @param alpha The strength parameter that drives the number of unique values as a function of draws. + * @param discount The discount parameter that drives the percentage of values that occur once in a large sample. + */ + public ChineseRestaurant(double alpha, double discount) { + Preconditions.checkArgument(alpha > 0, "Strength Parameter, alpha must be greater then 0!"); + Preconditions.checkArgument(discount >= 0 && discount <= 1, "Must be: 0 <= discount <= 1"); + this.alpha = alpha; + this.discount = discount; + } + + @Override + public Integer sample() { + double u = rand.nextDouble() * (alpha + weight); + for (int j = 0; j < weights.size(); j++) { + // select existing options with probability (w_j - d) / (alpha + w) + if (u < weights.get(j) - discount) { + weights.set(j, weights.get(j) + 1); + weight++; + return j; + } else { + u -= weights.get(j) - discount; + } + } + + // if no existing item selected, pick new item with probability (alpha - d*t) / (alpha + w) + // where t is number of pre-existing cases + weights.add(1); + weight++; + return weights.size() - 1; + } + + /** + * @return the number of unique values that have been returned. + */ + public int size() { + return weights.size(); + } + + /** + * @return the number draws so far. + */ + public int count() { + return (int) weight; + } + + /** + * @param j Which value to test. + * @return The number of times that j has been returned so far. + */ + public int count(int j) { + Preconditions.checkArgument(j >= 0); + + if (j < weights.size()) { + return (int) weights.get(j); + } else { + return 0; + } + } +} http://git-wip-us.apache.org/repos/asf/mahout/blob/545648f6/core/src/main/java/org/apache/mahout/math/random/Empirical.java ---------------------------------------------------------------------- diff --git a/core/src/main/java/org/apache/mahout/math/random/Empirical.java b/core/src/main/java/org/apache/mahout/math/random/Empirical.java new file mode 100644 index 0000000..78bfec5 --- /dev/null +++ b/core/src/main/java/org/apache/mahout/math/random/Empirical.java @@ -0,0 +1,124 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.math.random; + +import com.google.common.base.Preconditions; +import org.apache.mahout.common.RandomUtils; + +import java.util.Random; + +/** + * Samples from an empirical cumulative distribution. + */ +public final class Empirical extends AbstractSamplerFunction { + private final Random gen; + private final boolean exceedMinimum; + private final boolean exceedMaximum; + + private final double[] x; + private final double[] y; + private final int n; + + /** + * Sets up a sampler for a specified empirical cumulative distribution function. The distribution + * can have optional exponential tails on either or both ends, but otherwise does a linear + * interpolation between known points. + * + * @param exceedMinimum Should we generate samples less than the smallest quantile (i.e. generate a left tail)? + * @param exceedMaximum Should we generate samples greater than the largest observed quantile (i.e. generate a right + * tail)? + * @param samples The number of samples observed to get the quantiles. + * @param ecdf Alternating values that represent which percentile (in the [0..1] range) + * and values. For instance, if you have the min, median and max of 1, 3, 10 + * you should pass 0.0, 1, 0.5, 3, 1.0, 10. Note that the list must include + * the 0-th (1.0-th) quantile if the left (right) tail is not allowed. + */ + public Empirical(boolean exceedMinimum, boolean exceedMaximum, int samples, double... ecdf) { + Preconditions.checkArgument(ecdf.length % 2 == 0, "ecdf must have an even count of values"); + Preconditions.checkArgument(samples >= 3, "Sample size must be >= 3"); + + // if we can't exceed the observed bounds, then we have to be given the bounds. + Preconditions.checkArgument(exceedMinimum || ecdf[0] == 0); + Preconditions.checkArgument(exceedMaximum || ecdf[ecdf.length - 2] == 1); + + gen = RandomUtils.getRandom(); + + n = ecdf.length / 2; + x = new double[n]; + y = new double[n]; + + double lastX = ecdf[1]; + double lastY = ecdf[0]; + for (int i = 0; i < ecdf.length; i += 2) { + // values have to be monotonic increasing + Preconditions.checkArgument(i == 0 || ecdf[i + 1] > lastY); + y[i / 2] = ecdf[i + 1]; + lastY = y[i / 2]; + + // quantiles have to be in [0,1] and be monotonic increasing + Preconditions.checkArgument(ecdf[i] >= 0 && ecdf[i] <= 1); + Preconditions.checkArgument(i == 0 || ecdf[i] > lastX); + + x[i / 2] = ecdf[i]; + lastX = x[i / 2]; + } + + // squeeze a bit to allow for unobserved tails + double x0 = exceedMinimum ? 0.5 / samples : 0; + double x1 = 1 - (exceedMaximum ? 0.5 / samples : 0); + for (int i = 0; i < n; i++) { + x[i] = x[i] * (x1 - x0) + x0; + } + + this.exceedMinimum = exceedMinimum; + this.exceedMaximum = exceedMaximum; + } + + @Override + public Double sample() { + return sample(gen.nextDouble()); + } + + public double sample(double u) { + if (exceedMinimum && u < x[0]) { + // generate from left tail + if (u == 0) { + u = 1.0e-16; + } + return y[0] + Math.log(u / x[0]) * x[0] * (y[1] - y[0]) / (x[1] - x[0]); + } else if (exceedMaximum && u > x[n - 1]) { + if (u == 1) { + u = 1 - 1.0e-16; + } + // generate from right tail + double dy = y[n - 1] - y[n - 2]; + double dx = x[n - 1] - x[n - 2]; + return y[n - 1] - Math.log((1 - u) / (1 - x[n - 1])) * (1 - x[n - 1]) * dy / dx; + } else { + // linear interpolation + for (int i = 1; i < n; i++) { + if (x[i] > u) { + double dy = y[i] - y[i - 1]; + double dx = x[i] - x[i - 1]; + return y[i - 1] + (u - x[i - 1]) * dy / dx; + } + } + throw new RuntimeException(String.format("Can't happen (%.3f is not in [%.3f,%.3f]", u, x[0], x[n - 1])); + } + } +} http://git-wip-us.apache.org/repos/asf/mahout/blob/545648f6/core/src/main/java/org/apache/mahout/math/random/IndianBuffet.java ---------------------------------------------------------------------- diff --git a/core/src/main/java/org/apache/mahout/math/random/IndianBuffet.java b/core/src/main/java/org/apache/mahout/math/random/IndianBuffet.java new file mode 100644 index 0000000..27b5d84 --- /dev/null +++ b/core/src/main/java/org/apache/mahout/math/random/IndianBuffet.java @@ -0,0 +1,157 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.math.random; + +import com.google.common.base.CharMatcher; +import com.google.common.base.Charsets; +import com.google.common.base.Splitter; +import com.google.common.collect.Iterables; +import com.google.common.collect.Lists; +import com.google.common.io.LineProcessor; +import com.google.common.io.Resources; +import org.apache.mahout.common.RandomUtils; + +import java.io.IOException; +import java.util.List; +import java.util.Random; + +/** + * Samples a "document" from an IndianBuffet process. + * + * See http://mlg.eng.cam.ac.uk/zoubin/talks/turin09.pdf for details + */ +public final class IndianBuffet<T> implements Sampler<List<T>> { + private final List<Integer> count = Lists.newArrayList(); + private int documents = 0; + private final double alpha; + private WordFunction<T> converter = null; + private final Random gen; + + public IndianBuffet(double alpha, WordFunction<T> converter) { + this.alpha = alpha; + this.converter = converter; + gen = RandomUtils.getRandom(); + } + + public static IndianBuffet<Integer> createIntegerDocumentSampler(double alpha) { + return new IndianBuffet<>(alpha, new IdentityConverter()); + } + + public static IndianBuffet<String> createTextDocumentSampler(double alpha) { + return new IndianBuffet<>(alpha, new WordConverter()); + } + + @Override + public List<T> sample() { + List<T> r = Lists.newArrayList(); + if (documents == 0) { + double n = new PoissonSampler(alpha).sample(); + for (int i = 0; i < n; i++) { + r.add(converter.convert(i)); + count.add(1); + } + documents++; + } else { + documents++; + int i = 0; + for (double cnt : count) { + if (gen.nextDouble() < cnt / documents) { + r.add(converter.convert(i)); + count.set(i, count.get(i) + 1); + } + i++; + } + int newItems = new PoissonSampler(alpha / documents).sample().intValue(); + for (int j = 0; j < newItems; j++) { + r.add(converter.convert(i + j)); + count.add(1); + } + } + return r; + } + + private interface WordFunction<T> { + T convert(int i); + } + + /** + * Just converts to an integer. + */ + public static class IdentityConverter implements WordFunction<Integer> { + @Override + public Integer convert(int i) { + return i; + } + } + + /** + * Converts to a string. + */ + public static class StringConverter implements WordFunction<String> { + @Override + public String convert(int i) { + return String.valueOf(i); + } + } + + /** + * Converts to one of a list of common English words for reasonably small integers and converts + * to a token like w_92463 for big integers. + */ + public static final class WordConverter implements WordFunction<String> { + private final Splitter onSpace = Splitter.on(CharMatcher.WHITESPACE).omitEmptyStrings().trimResults(); + private final List<String> words; + + public WordConverter() { + try { + words = Resources.readLines(Resources.getResource("words.txt"), Charsets.UTF_8, + new LineProcessor<List<String>>() { + private final List<String> theWords = Lists.newArrayList(); + + @Override + public boolean processLine(String line) { + Iterables.addAll(theWords, onSpace.split(line)); + return true; + } + + @Override + public List<String> getResult() { + return theWords; + } + }); + } catch (IOException e) { + throw new ImpossibleException(e); + } + } + + @Override + public String convert(int i) { + if (i < words.size()) { + return words.get(i); + } else { + return "w_" + i; + } + } + } + + public static class ImpossibleException extends RuntimeException { + public ImpossibleException(Throwable e) { + super(e); + } + } +} http://git-wip-us.apache.org/repos/asf/mahout/blob/545648f6/core/src/main/java/org/apache/mahout/math/random/Missing.java ---------------------------------------------------------------------- diff --git a/core/src/main/java/org/apache/mahout/math/random/Missing.java b/core/src/main/java/org/apache/mahout/math/random/Missing.java new file mode 100644 index 0000000..8141a71 --- /dev/null +++ b/core/src/main/java/org/apache/mahout/math/random/Missing.java @@ -0,0 +1,59 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.math.random; + +import java.util.Random; + +import org.apache.mahout.common.RandomUtils; + +/** + * Models data with missing values. Note that all variables with the same fraction of missing + * values will have the same sequence of missing values. Similarly, if two variables have + * missing probabilities of p1 > p2, then all of the p2 missing values will also be missing for + * p1. + */ +public final class Missing<T> implements Sampler<T> { + private final Random gen; + private final double p; + private final Sampler<T> delegate; + private final T missingMarker; + + public Missing(int seed, double p, Sampler<T> delegate, T missingMarker) { + this.p = p; + this.delegate = delegate; + this.missingMarker = missingMarker; + gen = RandomUtils.getRandom(seed); + } + + public Missing(double p, Sampler<T> delegate, T missingMarker) { + this(1, p, delegate, missingMarker); + } + + public Missing(double p, Sampler<T> delegate) { + this(1, p, delegate, null); + } + + @Override + public T sample() { + if (gen.nextDouble() >= p) { + return delegate.sample(); + } else { + return missingMarker; + } + } +} http://git-wip-us.apache.org/repos/asf/mahout/blob/545648f6/core/src/main/java/org/apache/mahout/math/random/MultiNormal.java ---------------------------------------------------------------------- diff --git a/core/src/main/java/org/apache/mahout/math/random/MultiNormal.java b/core/src/main/java/org/apache/mahout/math/random/MultiNormal.java new file mode 100644 index 0000000..748d4e8 --- /dev/null +++ b/core/src/main/java/org/apache/mahout/math/random/MultiNormal.java @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.math.random; + +import org.apache.mahout.common.RandomUtils; +import org.apache.mahout.math.DenseVector; +import org.apache.mahout.math.DiagonalMatrix; +import org.apache.mahout.math.Matrix; +import org.apache.mahout.math.Vector; +import org.apache.mahout.math.function.DoubleFunction; + +import java.util.Random; + +/** + * Samples from a multi-variate normal distribution. + * <p/> + * This is done by sampling from several independent unit normal distributions to get a vector u. + * The sample value that is returned is then A u + m where A is derived from the covariance matrix + * and m is the mean of the result. + * <p/> + * If \Sigma is the desired covariance matrix, then you can use any value of A such that A' A = + * \Sigma. The Cholesky decomposition can be used to compute A if \Sigma is positive definite. + * Slightly more expensive is to use the SVD U S V' = \Sigma and then set A = U \sqrt{S}. + * + * Useful special cases occur when \Sigma is diagonal so that A = \sqrt(\Sigma) or where \Sigma = r I. + * + * Another special case is where m = 0. + */ +public class MultiNormal implements Sampler<Vector> { + private final Random gen; + private final int dimension; + private final Matrix scale; + private final Vector mean; + + /** + * Constructs a sampler with diagonal scale matrix. + * @param diagonal The diagonal elements of the scale matrix. + */ + public MultiNormal(Vector diagonal) { + this(new DiagonalMatrix(diagonal), null); + } + + /** + * Constructs a sampler with diagonal scale matrix and (potentially) + * non-zero mean. + * @param diagonal The scale matrix's principal diagonal. + * @param mean The desired mean. Set to null if zero mean is desired. + */ + public MultiNormal(Vector diagonal, Vector mean) { + this(new DiagonalMatrix(diagonal), mean); + } + + /** + * Constructs a sampler with non-trivial scale matrix and mean. + */ + public MultiNormal(Matrix a, Vector mean) { + this(a, mean, a.columnSize()); + } + + public MultiNormal(int dimension) { + this(null, null, dimension); + } + + public MultiNormal(double radius, Vector mean) { + this(new DiagonalMatrix(radius, mean.size()), mean); + } + + private MultiNormal(Matrix scale, Vector mean, int dimension) { + gen = RandomUtils.getRandom(); + this.dimension = dimension; + this.scale = scale; + this.mean = mean; + } + + @Override + public Vector sample() { + Vector v = new DenseVector(dimension).assign( + new DoubleFunction() { + @Override + public double apply(double ignored) { + return gen.nextGaussian(); + } + } + ); + if (mean != null) { + if (scale != null) { + return scale.times(v).plus(mean); + } else { + return v.plus(mean); + } + } else { + if (scale != null) { + return scale.times(v); + } else { + return v; + } + } + } + + public Vector getScale() { + return mean; + } +}
