http://git-wip-us.apache.org/repos/asf/mahout/blob/e0573de3/math/src/main/java/org/apache/mahout/math/list/package-info.java ---------------------------------------------------------------------- diff --git a/math/src/main/java/org/apache/mahout/math/list/package-info.java b/math/src/main/java/org/apache/mahout/math/list/package-info.java deleted file mode 100644 index 43b5c4b..0000000 --- a/math/src/main/java/org/apache/mahout/math/list/package-info.java +++ /dev/null @@ -1,144 +0,0 @@ -/** - * <HTML> - * <BODY> - * Resizable lists holding objects or primitive data types such as <tt>int</tt>, - * <tt>double</tt>, etc. For non-resizable lists (1-dimensional matrices) see - * package <code>org.apache.mahout.math.matrix</code>.<p></p> - * <h1><a name="Overview"></a>Getting Started</h1> - * <h2>1. Overview</h2> - * <p>The list package offers flexible object oriented abstractions modelling dynamically - * resizing lists holding objects or primitive data types such as <tt>int</tt>, - * <tt>double</tt>, etc. It is designed to be scalable in terms of performance - * and memory requirements.</p> - * <p>Features include: </p> - * <p></p> - * <ul> - * <li>Lists operating on objects as well as all primitive data types such as <tt>int</tt>, - * <tt>double</tt>, etc. - * </li> - * <li>Compact representations</li> - * <li>A number of general purpose list operations including: adding, inserting, - * removing, iterating, searching, sorting, extracting ranges and copying. All - * operations are designed to perform well on mass data. - * </li> - * <li>Support for quick access to list elements. This is achieved by bounds-checking - * and non-bounds-checking accessor methods as well as zero-copy transformations - * to primitive arrays such as <tt>int[]</tt>, <tt>double[]</tt>, etc. - * </li> - * <li>Allows to use high level algorithms on primitive data types without any - * space and time overhead. Operations on primitive arrays, Colt lists and JAL - * algorithms can freely be mixed at zero copy overhead. - * </li> - * </ul> - * <p>File-based I/O can be achieved through the standard Java built-in serialization - * mechanism. All classes implement the {@link java.io.Serializable} interface. - * However, the toolkit is entirely decoupled from advanced I/O. It provides data - * structures and algorithms only. - * <p> This toolkit borrows concepts and terminology from the Javasoft <a - * href="http://www.javasoft.com/products/jdk/1.2/docs/guide/collections/index.html"> - * Collections framework</a> written by Josh Bloch and introduced in JDK 1.2. - * <h2>2. Introduction</h2> - * <p>Lists are fundamental to virtually any application. Large scale resizable lists - * are, for example, used in scientific computations, simulations database management - * systems, to name just a few.</p> - * <h2></h2> - * <p>A list is a container holding elements that can be accessed via zero-based - * indexes. Lists may be implemented in different ways (most commonly with arrays). - * A resizable list automatically grows as elements are added. The lists of this - * package do not automatically shrink. Shrinking needs to be triggered by explicitly - * calling <tt>trimToSize()</tt> methods.</p> - * <p><i>Growing policy</i>: A list implemented with arrays initially has a certain - * <tt>initialCapacity</tt> - per default 10 elements, but customizable upon instance - * construction. As elements are added, this capacity may nomore be sufficient. - * When a list is automatically grown, its capacity is expanded to <tt>1.5*currentCapacity</tt>. - * Thus, excessive resizing (involving copying) is avoided.</p> - * <h4>Copying</h4> - * <p> - * <p>Any list can be copied. A copy is <i>equal</i> to the original but entirely - * independent of the original. So changes in the copy are not reflected in the - * original, and vice-versa. - * <h2>3. Organization of this package</h2> - * <p>Class naming follows the schema <tt><ElementType><ImplementationTechnique>List</tt>. - * For example, we have a {@link org.apache.mahout.math.list.DoubleArrayList}, which is a list - * holding <tt>double</tt> elements implemented with <tt>double</tt>[] arrays. - * </p> - * <p>The classes for lists of a given value type are derived from a common abstract - * base class tagged <tt>Abstract<ElementType></tt><tt>List</tt>. For example, - * all lists operating on <tt>double</tt> elements are derived from - * {@link org.apache.mahout.math.list.AbstractDoubleList}, - * which in turn is derived from an abstract base class tying together all lists - * regardless of value type, {@link org.apache.mahout.math.list.AbstractList}. The abstract - * base classes provide skeleton implementations for all but few methods. Experimental - * data layouts (such as compressed, sparse, linked, etc.) can easily be implemented - * and inherit a rich set of functionality. Have a look at the javadoc <a href="package-tree.html">tree - * view</a> to get the broad picture.</p> - * <h2>4. Example usage</h2> - * <p>The following snippet fills a list, randomizes it, extracts the first half - * of the elements, sums them up and prints the result. It is implemented entirely - * with accessor methods.</p> - * <table> - * <td class="PRE"> - * <pre> - * int s = 1000000;<br>AbstractDoubleList list = new DoubleArrayList(); - * for (int i=0; i<s; i++) { list.add((double)i); } - * list.shuffle(); - * AbstractDoubleList part = list.partFromTo(0,list.size()/2 - 1); - * double sum = 0.0; - * for (int i=0; i<part.size(); i++) { sum += part.get(i); } - * log.info(sum); - * </pre> - * </td> - * </table> - * <p> For efficiency, all classes provide back doors to enable getting/setting the - * backing array directly. In this way, the high level operations of these classes - * can be used where appropriate, and one can switch to <tt>[]</tt>-array index - * notations where necessary. The key methods for this are <tt>public <ElementType>[] - * elements()</tt> and <tt>public void elements(<ElementType>[])</tt>. The - * former trustingly returns the array it internally keeps to store the elements. - * Holding this array in hand, we can use the <tt>[]</tt>-array operator to - * perform iteration over large lists without needing to copy the array or paying - * the performance penalty introduced by accessor methods. Alternatively any JAL - * algorithm (or other algorithm) can operate on the returned primitive array. - * The latter method forces a list to internally hold a user provided array. Using - * this approach one can avoid needing to copy the elements into the list. - * <p>As a consequence, operations on primitive arrays, Colt lists and JAL algorithms - * can freely be mixed at zero-copy overhead. - * <p> Note that such special treatment certainly breaks encapsulation. This functionality - * is provided for performance reasons only and should only be used when absolutely - * necessary. Here is the above example in mixed notation: - * <table> - * <td class="PRE"> - * <pre> - * int s = 1000000;<br>DoubleArrayList list = new DoubleArrayList(s); // list.size()==0, capacity==s - * list.setSize(s); // list.size()==s<br>double[] values = list.elements(); - * // zero copy, values.length==s<br>for (int i=0; i<s; i++) { values[i]=(double)i; } - * list.shuffle(); - * double sum = 0.0; - * int limit = values.length/2; - * for (int i=0; i<limit; i++) { sum += values[i]; } - * log.info(sum); - * </pre> - * </td> - * </table> - * <p> Or even more compact using lists as algorithm objects: - * <table> - * <td class="PRE"> - * <pre> - * int s = 1000000;<br>double[] values = new double[s]; - * for (int i=0; i<s; i++) { values[i]=(double)i; } - * new DoubleArrayList(values).shuffle(); // zero-copy, shuffle via back door - * double sum = 0.0; - * int limit = values.length/2; - * for (int i=0; i<limit; i++) { sum += values[i]; } - * log.info(sum); - * </pre> - * </td> - * </table> - * <p> - * <h2>5. Notes </h2> - * <p>The quicksorts and mergesorts are the JDK 1.2 V1.26 algorithms, modified as - * necessary to operate on the given data types. - * </BODY> - * </HTML> - */ -package org.apache.mahout.math.list;
http://git-wip-us.apache.org/repos/asf/mahout/blob/e0573de3/math/src/main/java/org/apache/mahout/math/map/HashFunctions.java ---------------------------------------------------------------------- diff --git a/math/src/main/java/org/apache/mahout/math/map/HashFunctions.java b/math/src/main/java/org/apache/mahout/math/map/HashFunctions.java deleted file mode 100644 index b749307..0000000 --- a/math/src/main/java/org/apache/mahout/math/map/HashFunctions.java +++ /dev/null @@ -1,115 +0,0 @@ -/* -Copyright 1999 CERN - European Organization for Nuclear Research. -Permission to use, copy, modify, distribute and sell this software and its documentation for any purpose -is hereby granted without fee, provided that the above copyright notice appear in all copies and -that both that copyright notice and this permission notice appear in supporting documentation. -CERN makes no representations about the suitability of this software for any purpose. -It is provided "as is" without expressed or implied warranty. -*/ -package org.apache.mahout.math.map; - - -/** - * Provides various hash functions. - */ -public final class HashFunctions { - - /** - * Utility class pattern: all static members, no inheritance. - */ - private HashFunctions() { - } - - /** - * Returns a hashcode for the specified value. - * - * @return a hash code value for the specified value. - */ - public static int hash(char value) { - return value; - } - - /** - * Returns a hashcode for the specified value. - * - * @return a hash code value for the specified value. - */ - public static int hash(double value) { - long bits = Double.doubleToLongBits(value); - return (int) (bits ^ (bits >>> 32)); - - //return (int) Double.doubleToLongBits(value*663608941.737); - // this avoids excessive hashCollisions in the case values are of the form (1.0, 2.0, 3.0, ...) - } - - /** - * Returns a hashcode for the specified value. - * - * @return a hash code value for the specified value. - */ - public static int hash(float value) { - return Float.floatToIntBits(value * 663608941.737f); - // this avoids excessive hashCollisions in the case values are of the form (1.0, 2.0, 3.0, ...) - } - - /** - * Returns a hashcode for the specified value. - * The hashcode computation is similar to the last step - * of MurMurHash3. - * - * @return a hash code value for the specified value. - */ - public static int hash(int value) { - int h = value; - h ^= h >>> 16; - h *= 0x85ebca6b; - h ^= h >>> 13; - h *= 0xc2b2ae35; - h ^= h >>> 16; - return h; - } - - /** - * Returns a hashcode for the specified value. - * - * @return a hash code value for the specified value. - */ - public static int hash(long value) { - return (int) (value ^ (value >> 32)); - /* - value &= 0x7FFFFFFFFFFFFFFFL; // make it >=0 (0x7FFFFFFFFFFFFFFFL==Long.MAX_VALUE) - int hashCode = 0; - do hashCode = 31*hashCode + (int) (value%10); - while ((value /= 10) > 0); - - return 28629151*hashCode; // spread even further; h*31^5 - */ - } - - /** - * Returns a hashcode for the specified object. - * - * @return a hash code value for the specified object. - */ - public static int hash(Object object) { - return object == null ? 0 : object.hashCode(); - } - - /** - * Returns a hashcode for the specified value. - * - * @return a hash code value for the specified value. - */ - public static int hash(short value) { - return value; - } - - /** - * Returns a hashcode for the specified value. - * - * @return a hash code value for the specified value. - */ - public static int hash(boolean value) { - return value ? 1231 : 1237; - } -} http://git-wip-us.apache.org/repos/asf/mahout/blob/e0573de3/math/src/main/java/org/apache/mahout/math/map/OpenHashMap.java ---------------------------------------------------------------------- diff --git a/math/src/main/java/org/apache/mahout/math/map/OpenHashMap.java b/math/src/main/java/org/apache/mahout/math/map/OpenHashMap.java deleted file mode 100644 index 0efca4b..0000000 --- a/math/src/main/java/org/apache/mahout/math/map/OpenHashMap.java +++ /dev/null @@ -1,652 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -/* -Copyright � 1999 CERN - European Organization for Nuclear Research. -Permission to use, copy, modify, distribute and sell this software and its documentation for any purpose -is hereby granted without fee, provided that the above copyright notice appear in all copies and -that both that copyright notice and this permission notice appear in supporting documentation. -CERN makes no representations about the suitability of this software for any purpose. -It is provided "as is" without expressed or implied warranty. -*/ -package org.apache.mahout.math.map; - -import java.util.ArrayList; -import java.util.Arrays; -import java.util.Collection; -import java.util.List; -import java.util.Map; -import java.util.Set; - -import org.apache.mahout.math.function.ObjectObjectProcedure; -import org.apache.mahout.math.function.ObjectProcedure; -import org.apache.mahout.math.set.AbstractSet; -import org.apache.mahout.math.set.OpenHashSet; - -/** - * Open hash map. This implements Map, but it does not respect several aspects of the Map contract - * that impose the very sorts of performance penalities that this class exists to avoid. - * {@link #entrySet}, {@link #values}, and {@link #keySet()} do <strong>not</strong> return - * collections that share storage with the main map, and changes to those returned objects - * are <strong>not</strong> reflected in the container. - **/ -public class OpenHashMap<K,V> extends AbstractSet implements Map<K,V> { - protected static final byte FREE = 0; - protected static final byte FULL = 1; - protected static final byte REMOVED = 2; - protected static final Object NO_KEY_VALUE = null; - - /** The hash table keys. */ - protected Object[] table; - - /** The hash table values. */ - protected Object[] values; - - /** The state of each hash table entry (FREE, FULL, REMOVED). */ - protected byte[] state; - - /** The number of table entries in state==FREE. */ - protected int freeEntries; - - - /** Constructs an empty map with default capacity and default load factors. */ - public OpenHashMap() { - this(DEFAULT_CAPACITY); - } - - /** - * Constructs an empty map with the specified initial capacity and default load factors. - * - * @param initialCapacity the initial capacity of the map. - * @throws IllegalArgumentException if the initial capacity is less than zero. - */ - public OpenHashMap(int initialCapacity) { - this(initialCapacity, DEFAULT_MIN_LOAD_FACTOR, DEFAULT_MAX_LOAD_FACTOR); - } - - /** - * Constructs an empty map with the specified initial capacity and the specified minimum and maximum load factor. - * - * @param initialCapacity the initial capacity. - * @param minLoadFactor the minimum load factor. - * @param maxLoadFactor the maximum load factor. - * @throws IllegalArgumentException if <tt>initialCapacity < 0 || (minLoadFactor < 0.0 || minLoadFactor >= 1.0) || - * (maxLoadFactor <= 0.0 || maxLoadFactor >= 1.0) || (minLoadFactor >= - * maxLoadFactor)</tt>. - */ - public OpenHashMap(int initialCapacity, double minLoadFactor, double maxLoadFactor) { - setUp(initialCapacity, minLoadFactor, maxLoadFactor); - } - - /** Removes all (key,value) associations from the receiver. Implicitly calls <tt>trimToSize()</tt>. */ - @Override - public void clear() { - Arrays.fill(this.state, FREE); - distinct = 0; - freeEntries = table.length; // delta - trimToSize(); - } - - /** - * Returns a deep copy of the receiver. - * - * @return a deep copy of the receiver. - */ - @Override - @SuppressWarnings("unchecked") - public Object clone() { - OpenHashMap<K,V> copy = (OpenHashMap<K,V>) super.clone(); - copy.table = copy.table.clone(); - copy.values = copy.values.clone(); - copy.state = copy.state.clone(); - return copy; - } - - /** - * Returns <tt>true</tt> if the receiver contains the specified key. - * - * @return <tt>true</tt> if the receiver contains the specified key. - */ - @SuppressWarnings("unchecked") - @Override - public boolean containsKey(Object key) { - return indexOfKey((K)key) >= 0; - } - - /** - * Returns <tt>true</tt> if the receiver contains the specified value. - * - * @return <tt>true</tt> if the receiver contains the specified value. - */ - @SuppressWarnings("unchecked") - @Override - public boolean containsValue(Object value) { - return indexOfValue((V)value) >= 0; - } - - /** - * Ensures that the receiver can hold at least the specified number of associations without needing to allocate new - * internal memory. If necessary, allocates new internal memory and increases the capacity of the receiver. <p> This - * method never need be called; it is for performance tuning only. Calling this method before <tt>put()</tt>ing a - * large number of associations boosts performance, because the receiver will grow only once instead of potentially - * many times and hash collisions get less probable. - * - * @param minCapacity the desired minimum capacity. - */ - @Override - public void ensureCapacity(int minCapacity) { - if (table.length < minCapacity) { - int newCapacity = nextPrime(minCapacity); - rehash(newCapacity); - } - } - - /** - * Applies a procedure to each key of the receiver, if any. Note: Iterates over the keys in no particular order. - * Subclasses can define a particular order, for example, "sorted by key". All methods which <i>can</i> be expressed - * in terms of this method (most methods can) <i>must guarantee</i> to use the <i>same</i> order defined by this - * method, even if it is no particular order. This is necessary so that, for example, methods <tt>keys</tt> and - * <tt>values</tt> will yield association pairs, not two uncorrelated lists. - * - * @param procedure the procedure to be applied. Stops iteration if the procedure returns <tt>false</tt>, otherwise - * continues. - * @return <tt>false</tt> if the procedure stopped before all keys where iterated over, <tt>true</tt> otherwise. - */ - @SuppressWarnings("unchecked") - public boolean forEachKey(ObjectProcedure<K> procedure) { - for (int i = table.length; i-- > 0;) { - if (state[i] == FULL && !procedure.apply((K)table[i])) { - return false; - } - } - return true; - } - - /** - * Applies a procedure to each (key,value) pair of the receiver, if any. Iteration order is guaranteed to be - * <i>identical</i> to the order used by method {@link #forEachKey(ObjectProcedure)}. - * - * @param procedure the procedure to be applied. Stops iteration if the procedure returns <tt>false</tt>, otherwise - * continues. - * @return <tt>false</tt> if the procedure stopped before all keys where iterated over, <tt>true</tt> otherwise. - */ - @SuppressWarnings("unchecked") - public boolean forEachPair(ObjectObjectProcedure<K,V> procedure) { - for (int i = table.length; i-- > 0;) { - if (state[i] == FULL && !procedure.apply((K)table[i], (V)values[i])) { - return false; - } - } - return true; - } - - /** - * Returns the value associated with the specified key. It is often a good idea to first check with {@link - * #containsKey(Object)} whether the given key has a value associated or not, i.e. whether there exists an association - * for the given key or not. - * - * @param key the key to be searched for. - * @return the value associated with the specified key; <tt>0</tt> if no such key is present. - */ - @SuppressWarnings("unchecked") - @Override - public V get(Object key) { - int i = indexOfKey((K)key); - if (i < 0) { - return null; - } //not contained - return (V)values[i]; - } - - /** - * @param key the key to be added to the receiver. - * @return the index where the key would need to be inserted, if it is not already contained. Returns -index-1 if the - * key is already contained at slot index. Therefore, if the returned index < 0, then it is already contained - * at slot -index-1. If the returned index >= 0, then it is NOT already contained and should be inserted at - * slot index. - */ - protected int indexOfInsertion(K key) { - Object[] tab = table; - byte[] stat = state; - int length = tab.length; - - int hash = key.hashCode() & 0x7FFFFFFF; - int i = hash % length; - int decrement = hash % (length - 2); // double hashing, see http://www.eece.unm.edu/faculty/heileman/hash/node4.html - //int decrement = (hash / length) % length; - if (decrement == 0) { - decrement = 1; - } - - // stop if we find a removed or free slot, or if we find the key itself - // do NOT skip over removed slots (yes, open addressing is like that...) - while (stat[i] == FULL && !equalsMindTheNull(key, tab[i])) { - i -= decrement; - //hashCollisions++; - if (i < 0) { - i += length; - } - } - - if (stat[i] == REMOVED) { - // stop if we find a free slot, or if we find the key itself. - // do skip over removed slots (yes, open addressing is like that...) - // assertion: there is at least one FREE slot. - int j = i; - while (stat[i] != FREE && (stat[i] == REMOVED || tab[i] != key)) { - i -= decrement; - //hashCollisions++; - if (i < 0) { - i += length; - } - } - if (stat[i] == FREE) { - i = j; - } - } - - - if (stat[i] == FULL) { - // key already contained at slot i. - // return a negative number identifying the slot. - return -i - 1; - } - // not already contained, should be inserted at slot i. - // return a number >= 0 identifying the slot. - return i; - } - - /** - * @param key the key to be searched in the receiver. - * @return the index where the key is contained in the receiver, returns -1 if the key was not found. - */ - protected int indexOfKey(K key) { - Object[] tab = table; - byte[] stat = state; - int length = tab.length; - - int hash = key.hashCode() & 0x7FFFFFFF; - int i = hash % length; - int decrement = hash % (length - 2); // double hashing, see http://www.eece.unm.edu/faculty/heileman/hash/node4.html - //int decrement = (hash / length) % length; - if (decrement == 0) { - decrement = 1; - } - - // stop if we find a free slot, or if we find the key itself. - // do skip over removed slots (yes, open addressing is like that...) - while (stat[i] != FREE && (stat[i] == REMOVED || !equalsMindTheNull(key, tab[i]))) { - i -= decrement; - //hashCollisions++; - if (i < 0) { - i += length; - } - } - - if (stat[i] == FREE) { - return -1; - } // not found - return i; //found, return index where key is contained - } - - /** - * @param value the value to be searched in the receiver. - * @return the index where the value is contained in the receiver, returns -1 if the value was not found. - */ - protected int indexOfValue(V value) { - Object[] val = values; - byte[] stat = state; - - for (int i = stat.length; --i >= 0;) { - if (stat[i] == FULL && equalsMindTheNull(val[i], value)) { - return i; - } - } - - return -1; // not found - } - - /** - * Fills all keys contained in the receiver into the specified list. Fills the list, starting at index 0. After this - * call returns the specified list has a new size that equals <tt>this.size()</tt>. - * This method can be used - * to iterate over the keys of the receiver. - * - * @param list the list to be filled, can have any size. - */ - @SuppressWarnings("unchecked") - public void keys(List<K> list) { - list.clear(); - - - Object [] tab = table; - byte[] stat = state; - - for (int i = tab.length; i-- > 0;) { - if (stat[i] == FULL) { - list.add((K)tab[i]); - } - } - } - - /** - * Associates the given key with the given value. Replaces any old <tt>(key,someOtherValue)</tt> association, if - * existing. - * - * @param key the key the value shall be associated with. - * @param value the value to be associated. - * @return <tt>true</tt> if the receiver did not already contain such a key; <tt>false</tt> if the receiver did - * already contain such a key - the new value has now replaced the formerly associated value. - */ - @SuppressWarnings("unchecked") - @Override - public V put(K key, V value) { - int i = indexOfInsertion(key); - if (i < 0) { //already contained - i = -i - 1; - V previous = (V) this.values[i]; - this.values[i] = value; - return previous; - } - - if (this.distinct > this.highWaterMark) { - int newCapacity = chooseGrowCapacity(this.distinct + 1, this.minLoadFactor, this.maxLoadFactor); - rehash(newCapacity); - return put(key, value); - } - - this.table[i] = key; - this.values[i] = value; - if (this.state[i] == FREE) { - this.freeEntries--; - } - this.state[i] = FULL; - this.distinct++; - - if (this.freeEntries < 1) { //delta - int newCapacity = chooseGrowCapacity(this.distinct + 1, this.minLoadFactor, this.maxLoadFactor); - rehash(newCapacity); - } - - return null; - } - - /** - * Rehashes the contents of the receiver into a new table with a smaller or larger capacity. This method is called - * automatically when the number of keys in the receiver exceeds the high water mark or falls below the low water - * mark. - */ - @SuppressWarnings("unchecked") - protected void rehash(int newCapacity) { - int oldCapacity = table.length; - //if (oldCapacity == newCapacity) return; - - Object[] oldTable = table; - Object[] oldValues = values; - byte[] oldState = state; - - Object[] newTable = new Object[newCapacity]; - Object[] newValues = new Object[newCapacity]; - byte[] newState = new byte[newCapacity]; - - this.lowWaterMark = chooseLowWaterMark(newCapacity, this.minLoadFactor); - this.highWaterMark = chooseHighWaterMark(newCapacity, this.maxLoadFactor); - - this.table = newTable; - this.values = newValues; - this.state = newState; - this.freeEntries = newCapacity - this.distinct; // delta - - for (int i = oldCapacity; i-- > 0;) { - if (oldState[i] == FULL) { - Object element = oldTable[i]; - int index = indexOfInsertion((K)element); - newTable[index] = element; - newValues[index] = oldValues[i]; - newState[index] = FULL; - } - } - } - - /** - * Removes the given key with its associated element from the receiver, if present. - * - * @param key the key to be removed from the receiver. - * @return <tt>true</tt> if the receiver contained the specified key, <tt>false</tt> otherwise. - */ - @SuppressWarnings("unchecked") - @Override - public V remove(Object key) { - int i = indexOfKey((K)key); - if (i < 0) { - return null; - } - // key not contained - V removed = (V) values[i]; - - this.state[i] = REMOVED; - //this.values[i]=0; // delta - this.distinct--; - - if (this.distinct < this.lowWaterMark) { - int newCapacity = chooseShrinkCapacity(this.distinct, this.minLoadFactor, this.maxLoadFactor); - rehash(newCapacity); - } - - return removed; - } - - /** - * Initializes the receiver. - * - * @param initialCapacity the initial capacity of the receiver. - * @param minLoadFactor the minLoadFactor of the receiver. - * @param maxLoadFactor the maxLoadFactor of the receiver. - * @throws IllegalArgumentException if <tt>initialCapacity < 0 || (minLoadFactor < 0.0 || minLoadFactor >= 1.0) || - * (maxLoadFactor <= 0.0 || maxLoadFactor >= 1.0) || (minLoadFactor >= - * maxLoadFactor)</tt>. - */ - @Override - protected void setUp(int initialCapacity, double minLoadFactor, double maxLoadFactor) { - int capacity = initialCapacity; - super.setUp(capacity, minLoadFactor, maxLoadFactor); - capacity = nextPrime(capacity); - if (capacity == 0) { - capacity = 1; - } // open addressing needs at least one FREE slot at any time. - - this.table = new Object[capacity]; - this.values = new Object[capacity]; - this.state = new byte[capacity]; - - // memory will be exhausted long before this pathological case happens, anyway. - this.minLoadFactor = minLoadFactor; - if (capacity == PrimeFinder.LARGEST_PRIME) { - this.maxLoadFactor = 1.0; - } else { - this.maxLoadFactor = maxLoadFactor; - } - - this.distinct = 0; - this.freeEntries = capacity; // delta - - // lowWaterMark will be established upon first expansion. - // establishing it now (upon instance construction) would immediately make the table shrink upon first put(...). - // After all the idea of an "initialCapacity" implies violating lowWaterMarks when an object is young. - // See ensureCapacity(...) - this.lowWaterMark = 0; - this.highWaterMark = chooseHighWaterMark(capacity, this.maxLoadFactor); - } - - /** - * Trims the capacity of the receiver to be the receiver's current size. Releases any superfluous internal memory. An - * application can use this operation to minimize the storage of the receiver. - */ - @Override - public void trimToSize() { - // * 1.2 because open addressing's performance exponentially degrades beyond that point - // so that even rehashing the table can take very long - int newCapacity = nextPrime((int) (1 + 1.2 * size())); - if (table.length > newCapacity) { - rehash(newCapacity); - } - } - - /** - * Access for unit tests. - * @param capacity - * @param minLoadFactor - * @param maxLoadFactor - */ - void getInternalFactors(int[] capacity, - double[] minLoadFactor, - double[] maxLoadFactor) { - capacity[0] = table.length; - minLoadFactor[0] = this.minLoadFactor; - maxLoadFactor[0] = this.maxLoadFactor; - } - - private class MapEntry implements Map.Entry<K,V> { - private final K key; - private final V value; - - MapEntry(K key, V value) { - this.key = key; - this.value = value; - } - - @Override - public K getKey() { - return key; - } - - @Override - public V getValue() { - return value; - } - - @Override - public V setValue(V value) { - throw new UnsupportedOperationException("Map.Entry.setValue not supported for OpenHashMap"); - } - - } - - /** - * Allocate a set to contain Map.Entry objects for the pairs and return it. - */ - @Override - public Set<java.util.Map.Entry<K,V>> entrySet() { - final Set<Entry<K, V>> entries = new OpenHashSet<>(); - forEachPair(new ObjectObjectProcedure<K,V>() { - @Override - public boolean apply(K key, V value) { - entries.add(new MapEntry(key, value)); - return true; - } - }); - return entries; - } - - /** - * Allocate a set to contain keys and return it. - * This violates the 'backing' provisions of the map interface. - */ - @Override - public Set<K> keySet() { - final Set<K> keys = new OpenHashSet<>(); - forEachKey(new ObjectProcedure<K>() { - @Override - public boolean apply(K element) { - keys.add(element); - return true; - } - }); - return keys; - } - - @Override - public void putAll(Map<? extends K,? extends V> m) { - for (Map.Entry<? extends K, ? extends V> e : m.entrySet()) { - put(e.getKey(), e.getValue()); - } - } - - /** - * Allocate a list to contain the values and return it. - * This violates the 'backing' provision of the Map interface. - */ - @Override - public Collection<V> values() { - final List<V> valueList = new ArrayList<>(); - forEachPair(new ObjectObjectProcedure<K,V>() { - @Override - public boolean apply(K key, V value) { - valueList.add(value); - return true; - } - }); - return valueList; - } - - @SuppressWarnings("unchecked") - @Override - public boolean equals(Object obj) { - if (!(obj instanceof OpenHashMap)) { - return false; - } - final OpenHashMap<K,V> o = (OpenHashMap<K,V>) obj; - if (o.size() != size()) { - return false; - } - final boolean[] equal = new boolean[1]; - equal[0] = true; - forEachPair(new ObjectObjectProcedure<K,V>() { - @Override - public boolean apply(K key, V value) { - Object ov = o.get(key); - if (!value.equals(ov)) { - equal[0] = false; - return false; - } - return true; - } - }); - return equal[0]; - } - - @Override - public String toString() { - final StringBuilder sb = new StringBuilder(); - sb.append('{'); - forEachPair(new ObjectObjectProcedure<K,V>() { - @Override - public boolean apply(K key, V value) { - sb.append('['); - sb.append(key); - sb.append(" -> "); - sb.append(value); - sb.append("] "); - return true; - } - }); - sb.append('}'); - return sb.toString(); - } -} http://git-wip-us.apache.org/repos/asf/mahout/blob/e0573de3/math/src/main/java/org/apache/mahout/math/map/PrimeFinder.java ---------------------------------------------------------------------- diff --git a/math/src/main/java/org/apache/mahout/math/map/PrimeFinder.java b/math/src/main/java/org/apache/mahout/math/map/PrimeFinder.java deleted file mode 100644 index b02611e..0000000 --- a/math/src/main/java/org/apache/mahout/math/map/PrimeFinder.java +++ /dev/null @@ -1,145 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -package org.apache.mahout.math.map; - -import java.util.Arrays; - -/** - * Not of interest for users; only for implementors of hashtables. - * Used to keep hash table capacities prime numbers. - * - * <p>Choosing prime numbers as hash table capacities is a good idea to keep them working fast, - * particularly under hash table expansions. - * - * <p>However, JDK 1.2, JGL 3.1 and many other toolkits do nothing to keep capacities prime. - * This class provides efficient means to choose prime capacities. - * - * <p>Choosing a prime is <tt>O(log 300)</tt> (binary search in a list of 300 int's). - * Memory requirements: 1 KB static memory. - * - */ -public final class PrimeFinder { - - /** The largest prime this class can generate; currently equal to <tt>Integer.MAX_VALUE</tt>. */ - public static final int LARGEST_PRIME = Integer.MAX_VALUE; //yes, it is prime. - - /** - * The prime number list consists of 11 chunks. Each chunk contains prime numbers. A chunk starts with a prime P1. The - * next element is a prime P2. P2 is the smallest prime for which holds: P2 >= 2*P1. The next element is P3, for which - * the same holds with respect to P2, and so on. - * - * Chunks are chosen such that for any desired capacity >= 1000 the list includes a prime number <= desired capacity * - * 1.11 (11%). For any desired capacity >= 200 the list includes a prime number <= desired capacity * 1.16 (16%). For - * any desired capacity >= 16 the list includes a prime number <= desired capacity * 1.21 (21%). - * - * Therefore, primes can be retrieved which are quite close to any desired capacity, which in turn avoids wasting - * memory. For example, the list includes 1039,1117,1201,1277,1361,1439,1523,1597,1759,1907,2081. So if you need a - * prime >= 1040, you will find a prime <= 1040*1.11=1154. - * - * Chunks are chosen such that they are optimized for a hashtable growthfactor of 2.0; If your hashtable has such a - * growthfactor then, after initially "rounding to a prime" upon hashtable construction, it will later expand to prime - * capacities such that there exist no better primes. - * - * In total these are about 32*10=320 numbers -> 1 KB of static memory needed. If you are stingy, then delete every - * second or fourth chunk. - */ - - private static final int[] PRIME_CAPACITIES = { - //chunk #0 - LARGEST_PRIME, - - //chunk #1 - 5, 11, 23, 47, 97, 197, 397, 797, 1597, 3203, 6421, 12853, 25717, 51437, 102877, 205759, - 411527, 823117, 1646237, 3292489, 6584983, 13169977, 26339969, 52679969, 105359939, - 210719881, 421439783, 842879579, 1685759167, - - //chunk #2 - 433, 877, 1759, 3527, 7057, 14143, 28289, 56591, 113189, 226379, 452759, 905551, 1811107, - 3622219, 7244441, 14488931, 28977863, 57955739, 115911563, 231823147, 463646329, 927292699, - 1854585413, - - //chunk #3 - 953, 1907, 3821, 7643, 15287, 30577, 61169, 122347, 244703, 489407, 978821, 1957651, 3915341, - 7830701, 15661423, 31322867, 62645741, 125291483, 250582987, 501165979, 1002331963, - 2004663929, - - //chunk #4 - 1039, 2081, 4177, 8363, 16729, 33461, 66923, 133853, 267713, 535481, 1070981, 2141977, 4283963, - 8567929, 17135863, 34271747, 68543509, 137087021, 274174111, 548348231, 1096696463, - - //chunk #5 - 31, 67, 137, 277, 557, 1117, 2237, 4481, 8963, 17929, 35863, 71741, 143483, 286973, 573953, - 1147921, 2295859, 4591721, 9183457, 18366923, 36733847, 73467739, 146935499, 293871013, - 587742049, 1175484103, - - //chunk #6 - 599, 1201, 2411, 4831, 9677, 19373, 38747, 77509, 155027, 310081, 620171, 1240361, 2480729, - 4961459, 9922933, 19845871, 39691759, 79383533, 158767069, 317534141, 635068283, 1270136683, - - //chunk #7 - 311, 631, 1277, 2557, 5119, 10243, 20507, 41017, 82037, 164089, 328213, 656429, 1312867, - 2625761, 5251529, 10503061, 21006137, 42012281, 84024581, 168049163, 336098327, 672196673, - 1344393353, - - //chunk #8 - 3, 7, 17, 37, 79, 163, 331, 673, 1361, 2729, 5471, 10949, 21911, 43853, 87719, 175447, 350899, - 701819, 1403641, 2807303, 5614657, 11229331, 22458671, 44917381, 89834777, 179669557, - 359339171, 718678369, 1437356741, - - //chunk #9 - 43, 89, 179, 359, 719, 1439, 2879, 5779, 11579, 23159, 46327, 92657, 185323, 370661, 741337, - 1482707, 2965421, 5930887, 11861791, 23723597, 47447201, 94894427, 189788857, 379577741, - 759155483, 1518310967, - - //chunk #10 - 379, 761, 1523, 3049, 6101, 12203, 24407, 48817, 97649, 195311, 390647, 781301, 1562611, - 3125257, 6250537, 12501169, 25002389, 50004791, 100009607, 200019221, 400038451, 800076929, - 1600153859 - }; - - - static { //initializer - // The above prime numbers are formatted for human readability. - // To find numbers fast, we sort them once and for all. - - Arrays.sort(PRIME_CAPACITIES); - } - - /** Makes this class non instantiable, but still let's others inherit from it. */ - private PrimeFinder() { - } - - /** - * Returns a prime number which is {@code <= desiredCapacity} and very close to {@code desiredCapacity} - * (within 11% if {@code desiredCapacity <= 1000}). - * - * @param desiredCapacity the capacity desired by the user. - * @return the capacity which should be used for a hashtable. - */ - public static int nextPrime(int desiredCapacity) { - int i = java.util.Arrays.binarySearch(PRIME_CAPACITIES, desiredCapacity); - if (i < 0) { - // desired capacity not found, choose next prime greater than desired capacity - i = -i - 1; // remember the semantics of binarySearch... - } - return PRIME_CAPACITIES[i]; - } - -} http://git-wip-us.apache.org/repos/asf/mahout/blob/e0573de3/math/src/main/java/org/apache/mahout/math/map/QuickOpenIntIntHashMap.java ---------------------------------------------------------------------- diff --git a/math/src/main/java/org/apache/mahout/math/map/QuickOpenIntIntHashMap.java b/math/src/main/java/org/apache/mahout/math/map/QuickOpenIntIntHashMap.java deleted file mode 100644 index 6a7cef8..0000000 --- a/math/src/main/java/org/apache/mahout/math/map/QuickOpenIntIntHashMap.java +++ /dev/null @@ -1,215 +0,0 @@ -/* -Copyright � 1999 CERN - European Organization for Nuclear Research. -Permission to use, copy, modify, distribute and sell this software and its documentation for any purpose -is hereby granted without fee, provided that the above copyright notice appear in all copies and -that both that copyright notice and this permission notice appear in supporting documentation. -CERN makes no representations about the suitability of this software for any purpose. -It is provided "as is" without expressed or implied warranty. -*/ -package org.apache.mahout.math.map; - -/** - * Status: Experimental; Do not use for production yet. Hash map holding (key,value) associations of type - * <tt>(int-->int)</tt>; Automatically grows and shrinks as needed; Implemented using open addressing with double - * hashing. First see the <a href="package-summary.html">package summary</a> and javadoc <a - * href="package-tree.html">tree view</a> to get the broad picture. - * - * Implements open addressing with double hashing, using "Brent's variation". Brent's variation slows insertions a bit - * down (not much) but reduces probes (collisions) for successful searches, in particular for large load factors. (It - * does not improve unsuccessful searches.) See D. Knuth, Searching and Sorting, 3rd ed., p.533-545 - * - * @author [email protected] - * @version 1.0, 09/24/99 - * @see java.util.HashMap - */ -class QuickOpenIntIntHashMap extends OpenIntIntHashMap { - //public int totalProbesSaved = 0; // benchmark only - - /** Constructs an empty map with default capacity and default load factors. */ - QuickOpenIntIntHashMap() { - this(DEFAULT_CAPACITY); - } - - /** - * Constructs an empty map with the specified initial capacity and default load factors. - * - * @param initialCapacity the initial capacity of the map. - * @throws IllegalArgumentException if the initial capacity is less than zero. - */ - QuickOpenIntIntHashMap(int initialCapacity) { - this(initialCapacity, DEFAULT_MIN_LOAD_FACTOR, DEFAULT_MAX_LOAD_FACTOR); - } - - /** - * Constructs an empty map with the specified initial capacity and the specified minimum and maximum load factor. - * - * @param initialCapacity the initial capacity. - * @param minLoadFactor the minimum load factor. - * @param maxLoadFactor the maximum load factor. - * @throws IllegalArgumentException if <tt>initialCapacity < 0 || (minLoadFactor < 0.0 || minLoadFactor >= 1.0) || - * (maxLoadFactor <= 0.0 || maxLoadFactor >= 1.0) || (minLoadFactor >= - * maxLoadFactor)</tt>. - */ - QuickOpenIntIntHashMap(int initialCapacity, double minLoadFactor, double maxLoadFactor) { - setUp(initialCapacity, minLoadFactor, maxLoadFactor); - } - - /** - * Associates the given key with the given value. Replaces any old <tt>(key,someOtherValue)</tt> association, if - * existing. - * - * @param key the key the value shall be associated with. - * @param value the value to be associated. - * @return <tt>true</tt> if the receiver did not already contain such a key; <tt>false</tt> if the receiver did - * already contain such a key - the new value has now replaced the formerly associated value. - */ - @Override - public boolean put(int key, int value) { - /* - This is open addressing with double hashing, using "Brent's variation". - Brent's variation slows insertions a bit down (not much) but reduces probes (collisions) for successful searches, - in particular for large load factors. - (It does not improve unsuccessful searches.) - See D. Knuth, Searching and Sorting, 3rd ed., p.533-545 - - h1(key) = hash % M - h2(key) = decrement = Max(1, hash/M % M) - M is prime = capacity = table.length - probing positions are table[(h1-j*h2) % M] for j=0,1,... - (M and h2 could also be chosen differently, but h2 is required to be relative prime to M.) - */ - - int[] tab = table; - byte[] stat = state; - int length = tab.length; - - int hash = HashFunctions.hash(key) & 0x7FFFFFFF; - int i = hash % length; - int decrement = (hash / length) % length; - if (decrement == 0) { - decrement = 1; - } - - // stop if we find a removed or free slot, or if we find the key itself - // do NOT skip over removed slots (yes, open addressing is like that...) - //int comp = comparisons; - int t = 0; // the number of probes - int p0 = i; // the first position to probe - while (stat[i] == FULL && tab[i] != key) { - t++; - i -= decrement; - //hashCollisions++; - if (i < 0) { - i += length; - } - } - if (stat[i] == FULL) { - // key already contained at slot i. - this.values[i] = value; - return false; - } - // not already contained, should be inserted at slot i. - - if (this.distinct > this.highWaterMark) { - int newCapacity = chooseGrowCapacity(this.distinct + 1, this.minLoadFactor, this.maxLoadFactor); - rehash(newCapacity); - return put(key, value); - } - - /* - Brent's variation does a local reorganization to reduce probes. It essentially means: - We test whether it is possible to move the association we probed first (table[p0]) out of the way. - If this is possible, it will reduce probes for the key to be inserted, since it takes its place; - it gets hit earlier. - However, future probes for the key that we move out of the way will increase. - Thus we only move it out of the way, if we have a net gain, that is, if we save more probes than we loose. - For the first probe we safe more than we loose if the number of probes we needed was >=2 (t>=2). - If the first probe cannot be moved out of the way, we try the next probe (p1). - Now we safe more than we loose if t>=3. - We repeat this until we find that we cannot gain or that we can indeed move p(x) out of the way. - - Note: Under the great majority of insertions t<=1, so the loop is entered very infrequently. - */ - while (t > 1) { - int key0 = tab[p0]; - hash = HashFunctions.hash(key0) & 0x7FFFFFFF; - decrement = (hash / length) % length; - if (decrement == 0) { - decrement = 1; - } - int pc = p0 - decrement; // pc = (p0-j*decrement) % M, j=1,2,.. - if (pc < 0) { - pc += length; - } - - if (stat[pc] != FREE) { // not a free slot, continue searching for free slot to move to, or break. - p0 = pc; - t--; - } else { // free or removed slot found, now move... - tab[pc] = key0; - stat[pc] = FULL; - values[pc] = values[p0]; - i = p0; // prepare to insert: table[p0]=key - t = 0; // break loop - } - } - - this.table[i] = key; - this.values[i] = value; - if (this.state[i] == FREE) { - this.freeEntries--; - } - this.state[i] = FULL; - this.distinct++; - - if (this.freeEntries < 1) { //delta - int newCapacity = chooseGrowCapacity(this.distinct + 1, this.minLoadFactor, this.maxLoadFactor); - rehash(newCapacity); - } - - return true; - } - - /** - * Rehashes the contents of the receiver into a new table with a smaller or larger capacity. This method is called - * automatically when the number of keys in the receiver exceeds the high water mark or falls below the low water - * mark. - */ - @Override - protected void rehash(int newCapacity) { - int oldCapacity = table.length; - //if (oldCapacity == newCapacity) return; - - int[] oldTable = table; - int[] oldValues = values; - byte[] oldState = state; - - int[] newTable = new int[newCapacity]; - int[] newValues = new int[newCapacity]; - byte[] newState = new byte[newCapacity]; - - this.lowWaterMark = chooseLowWaterMark(newCapacity, this.minLoadFactor); - this.highWaterMark = chooseHighWaterMark(newCapacity, this.maxLoadFactor); - - this.table = newTable; - this.values = newValues; - this.state = newState; - this.freeEntries = newCapacity - this.distinct; // delta - - int tmp = this.distinct; - this.distinct = Integer.MIN_VALUE; // switch of watermarks - for (int i = oldCapacity; i-- > 0;) { - if (oldState[i] == FULL) { - put(oldTable[i], oldValues[i]); - /* - int element = oldTable[i]; - int index = indexOfInsertion(element); - newTable[index]=element; - newValues[index]=oldValues[i]; - newState[index]=FULL; - */ - } - } - this.distinct = tmp; - } -} http://git-wip-us.apache.org/repos/asf/mahout/blob/e0573de3/math/src/main/java/org/apache/mahout/math/map/package-info.java ---------------------------------------------------------------------- diff --git a/math/src/main/java/org/apache/mahout/math/map/package-info.java b/math/src/main/java/org/apache/mahout/math/map/package-info.java deleted file mode 100644 index 9356f45..0000000 --- a/math/src/main/java/org/apache/mahout/math/map/package-info.java +++ /dev/null @@ -1,250 +0,0 @@ -/** - * <HTML> - * <BODY> - * Automatically growing and shrinking maps holding objects or primitive - * data types such as <tt>int</tt>, <tt>double</tt>, etc. Currently all maps are - * based upon hashing. - * <h2><a name="Overview"></a>1. Overview</h2> - * <p>The map package offers flexible object oriented abstractions modelling automatically - * resizing maps. It is designed to be scalable in terms of performance and memory - * requirements.</p> - * <p>Features include: </p> - * <p></p> - * <ul> - * <li>Maps operating on objects as well as all primitive data types such as <code>int</code>, - * <code>double</code>, etc. - * </li> - * <li>Compact representations</li> - * <li>Support for quick access to associations</li> - * <li>A number of general purpose map operations</li> - * </ul> - * <p>File-based I/O can be achieved through the standard Java built-in serialization - * mechanism. All classes implement the {@link java.io.Serializable} interface. - * However, the toolkit is entirely decoupled from advanced I/O. It provides data - * structures and algorithms only. - * <p> This toolkit borrows some terminology from the Javasoft <a - * href="http://www.javasoft.com/products/jdk/1.2/docs/guide/collections/index.html"> - * Collections framework</a> written by Josh Bloch and introduced in JDK 1.2. - * <h2>2. Introduction</h2> - * <p>A map is an associative container that manages a set of (key,value) pairs. - * It is useful for implementing a collection of one-to-one mappings. A (key,value) - * pair is called an <i>association</i>. A value can be looked up up via its key. - * Associations can quickly be set, removed and retrieved. They are stored in a - * hashing structure based on the hash code of their keys, which is obtained by - * using a hash function. </p> - * <p> A map can, for example, contain <tt>Name-->Location</tt> associations like - * <tt>{("Pete", "Geneva"), ("Steve", "Paris"), ("Robert", "New York")}</tt> used - * in address books or <tt>Index-->Value</tt> mappings like <tt>{(0, 100), (3, - * 1000), (100000, 70)}</tt> representing sparse lists or matrices. For example - * this could mean at index 0 we have a value of 100, at index 3 we have a value - * of 1000, at index 1000000 we have a value of 70, and at all other indexes we - * have a value of, say, zero. Another example is a map of IP addresses to domain - * names (DNS). Maps can also be useful to represent<i> multi sets</i>, that is, - * sets where elements can occur more than once. For multi sets one would have - * <tt>Value-->Frequency</tt> mappings like <tt>{(100, 1), (50, 1000), (101, 3))}</tt> - * meaning element 100 occurs 1 time, element 50 occurs 1000 times, element 101 - * occurs 3 times. Further, maps can also manage <tt>ObjectIdentifier-->Object</tt> - * mappings like <tt>{(12, obj1), (7, obj2), (10000, obj3), (9, obj4)}</tt> used - * in Object Databases. - * <p> A map cannot contain two or more <i>equal</i> keys; a key can map to at most - * one value. However, more than one key can map to identical values. For primitive - * data types "equality" of keys is defined as identity (operator <tt>==</tt>). - * For maps using <tt>Object</tt> keys, the meaning of "equality" can be specified - * by the user upon instance construction. It can either be defined to be identity - * (operator <tt>==</tt>) or to be given by the method {@link java.lang.Object#equals(Object)}. - * Associations of kind <tt>(AnyType,Object)</tt> can be of the form <tt>(AnyKey,null) - * </tt>, i.e. values can be <tt>null</tt>. - * <p> The classes of this package make no guarantees as to the order of the elements - * returned by iterators; in particular, they do not guarantee that the order will - * remain constant over time. - * <h2></h2> - * <h4>Copying</h4> - * <p> - * <p>Any map can be copied. A copy is <i>equal</i> to the original but entirely - * independent of the original. So changes in the copy are not reflected in the - * original, and vice-versa. - * <h2>3. Package organization</h2> - * <p>For most primitive data types and for objects there exists a separate map version. - * All versions are just the same, except that they operate on different data types. - * Colt includes two kinds of implementations for maps: The two different implementations - * are tagged <b>Chained</b> and <b>Open</b>. - * Note: Chained is no more included. Wherever it is mentioned it is of historic interest only.</p> - * <ul> - * <li><b>Chained</b> uses extendible separate chaining with chains holding unsorted - * dynamically linked collision lists. - * <li><b>Open</b> uses extendible open addressing with double hashing. - * </ul> - * <p>Class naming follows the schema <tt><Implementation><KeyType><ValueType>HashMap</tt>. - * For example, a {@link org.apache.mahout.math.map.OpenIntDoubleHashMap} holds <tt>(int-->double)</tt> - * associations and is implemented with open addressing. A {@link org.apache.mahout.math.map.OpenIntObjectHashMap} - * holds <tt>(int-->Object)</tt> associations and is implemented with open addressing. - * </p> - * <p>The classes for maps of a given (key,value) type are derived from a common - * abstract base class tagged <tt>Abstract<KeyType><ValueType></tt><tt>Map</tt>. - * For example, all maps operating on <tt>(int-->double)</tt> associations are - * derived from {@link org.apache.mahout.math.map.AbstractIntDoubleMap}, which in turn is derived - * from an abstract base class tying together all maps regardless of assocation - * type, {@link org.apache.mahout.math.set.AbstractSet}. The abstract base classes provide skeleton - * implementations for all but few methods. Experimental layouts (such as chaining, - * open addressing, extensible hashing, red-black-trees, etc.) can easily be implemented - * and inherit a rich set of functionality. Have a look at the javadoc <a href="package-tree.html">tree - * view</a> to get the broad picture.</p> - * <h2>4. Example usage</h2> - * <TABLE> - * <TD CLASS="PRE"> - * <PRE> - * int[] keys = {0 , 3 , 100000, 9 }; - * double[] values = {100.0, 1000.0, 70.0 , 71.0}; - * AbstractIntDoubleMap map = new OpenIntDoubleHashMap(); - * // add several associations - * for (int i=0; i < keys.length; i++) map.put(keys[i], values[i]); - * log.info("map="+map); - * log.info("size="+map.size()); - * log.info(map.containsKey(3)); - * log.info("get(3)="+map.get(3)); - * log.info(map.containsKey(4)); - * log.info("get(4)="+map.get(4)); - * log.info(map.containsValue(71.0)); - * log.info("keyOf(71.0)="+map.keyOf(71.0)); - * // remove one association - * map.removeKey(3); - * log.info("\nmap="+map); - * log.info(map.containsKey(3)); - * log.info("get(3)="+map.get(3)); - * log.info(map.containsValue(1000.0)); - * log.info("keyOf(1000.0)="+map.keyOf(1000.0)); - * // clear - * map.clear(); - * log.info("\nmap="+map); - * log.info("size="+map.size()); - * </PRE> - * </TD> - * </TABLE> - * yields the following output - * <TABLE> - * <TD CLASS="PRE"> - * <PRE> - * map=[0->100.0, 3->1000.0, 9->71.0, 100000->70.0] - * size=4 - * true - * get(3)=1000.0 - * false - * get(4)=0.0 - * true - * keyOf(71.0)=9 - * map=[0->100.0, 9->71.0, 100000->70.0] - * false - * get(3)=0.0 - * false - * keyOf(1000.0)=-2147483648 - * map=[] - * size=0 - * </PRE> - * </TD> - * </TABLE> - * <h2> 5. Notes </h2> - * <p> - * Note that implementations are not synchronized. - * <p> - * Choosing efficient parameters for hash maps is not always easy. - * However, since parameters determine efficiency and memory requirements, here is a quick guide how to choose them. - * If your use case does not heavily operate on hash maps but uses them just because they provide - * convenient functionality, you can safely skip this section. - * For those of you who care, read on. - * <p> - * There are three parameters that can be customized upon map construction: <tt>initialCapacity</tt>, - * <tt>minLoadFactor</tt> and <tt>maxLoadFactor</tt>. - * The more memory one can afford, the faster a hash map. - * The hash map's capacity is the maximum number of associations that can be added without needing to allocate new - * internal memory. - * A larger capacity means faster adding, searching and removing. - * The <tt>initialCapacity</tt> corresponds to the capacity used upon instance construction. - * <p> - * The <tt>loadFactor</tt> of a hash map measures the degree of "fullness". - * It is given by the number of assocations (<tt>size()</tt>) - * divided by the hash map capacity <tt>(0.0 <= loadFactor <= 1.0)</tt>. - * The more associations are added, the larger the loadFactor and the more hash map performance degrades. - * Therefore, when the loadFactor exceeds a customizable threshold (<tt>maxLoadFactor</tt>), the hash map is - * automatically grown. - * In such a way performance degradation can be avoided. - * Similarly, when the loadFactor falls below a customizable threshold (<tt>minLoadFactor</tt>), the hash map is - * automatically shrinked. - * In such a way excessive memory consumption can be avoided. - * Automatic resizing (both growing and shrinking) obeys the following invariant: - * <p> - * <tt>capacity * minLoadFactor <= size() <= capacity * maxLoadFactor</tt> - * <p> The term <tt>capacity * minLoadFactor</tt> is called the <i>low water mark</i>, - * <tt>capacity * maxLoadFactor</tt> is called the <i>high water mark</i>. In other - * words, the number of associations may vary within the water mark constraints. - * When it goes out of range, the map is automatically resized and memory consumption - * changes proportionally. - * <ul> - * <li>To tune for memory at the expense of performance, both increase <tt>minLoadFactor</tt> and - * <tt>maxLoadFactor</tt>. - * <li>To tune for performance at the expense of memory, both decrease <tt>minLoadFactor</tt> and - * <tt>maxLoadFactor</tt>. - * As as special case set <tt>minLoadFactor=0</tt> to avoid any automatic shrinking. - * </ul> - * Resizing large hash maps can be time consuming, <tt>O(size())</tt>, and should be avoided if possible (maintaining - * primes is not the reason). - * Unnecessary growing operations can be avoided if the number of associations is known before they are added, or can be - * estimated.<p> - * In such a case good parameters are as follows: - * <p> - * <i>For chaining:</i> - * <br>Set the <tt>initialCapacity = 1.4*expectedSize</tt> or greater. - * <br>Set the <tt>maxLoadFactor = 0.8</tt> or greater. - * <p> - * <i>For open addressing:</i> - * <br>Set the <tt>initialCapacity = 2*expectedSize</tt> or greater. Alternatively call <tt>ensureCapacity(...)</tt>. - * <br>Set the <tt>maxLoadFactor = 0.5</tt>. - * <br>Never set <tt>maxLoadFactor > 0.55</tt>; open addressing exponentially slows down beyond that point. - * <p> - * In this way the hash map will never need to grow and still stay fast. - * It is never a good idea to set <tt>maxLoadFactor < 0.1</tt>, - * because the hash map would grow too often. - * If it is entirelly unknown how many associations the application will use, - * the default constructor should be used. The map will grow and shrink as needed. - * <p> - * <b>Comparision of chaining and open addressing</b> - * <p> Chaining is faster than open addressing, when assuming unconstrained memory - * consumption. Open addressing is more space efficient than chaining, because - * it does not create entry objects but uses primitive arrays which are considerably - * smaller. Entry objects consume significant amounts of memory compared to the - * information they actually hold. Open addressing also poses no problems to the - * garbage collector. In contrast, chaining can create millions of entry objects - * which are linked; a nightmare for any garbage collector. In addition, entry - * object creation is a bit slow. <br> - * Therefore, with the same amount of memory, or even less memory, hash maps with - * larger capacity can be maintained under open addressing, which yields smaller - * loadFactors, which in turn keeps performance competitive with chaining. In our - * benchmarks, using significantly less memory, open addressing usually is not - * more than 1.2-1.5 times slower than chaining. - * <p><b>Further readings</b>: - * <br>Knuth D., The Art of Computer Programming: Searching and Sorting, 3rd ed. - * <br>Griswold W., Townsend G., The Design and Implementation of Dynamic Hashing for Sets and Tables in Icon, - * Software - Practice and Experience, Vol. 23(4), 351-367 (April 1993). - * <br>Larson P., Dynamic hash tables, Comm. of the ACM, 31, (4), 1988. - * <p> - * <b>Performance:</b> - * <p> - * Time complexity: - * <br>The classes offer <i>expected</i> time complexity <tt>O(1)</tt> (i.e. constant time) for the basic operations - * <tt>put</tt>, <tt>get</tt>, <tt>removeKey</tt>, <tt>containsKey</tt> and <tt>size</tt>, - * assuming the hash function disperses the elements properly among the buckets. - * Otherwise, pathological cases, although highly improbable, can occur, degrading performance to <tt>O(N)</tt> in the - * worst case. - * Operations <tt>containsValue</tt> and <tt>keyOf</tt> are <tt>O(N)</tt>. - * <p> - * Memory requirements for <i>open addressing</i>: - * <br>worst case: <tt>memory [bytes] = (1/minLoadFactor) * size() * (1 + sizeOf(key) + sizeOf(value))</tt>. - * <br>best case: <tt>memory [bytes] = (1/maxLoadFactor) * size() * (1 + sizeOf(key) + sizeOf(value))</tt>. - * Where <tt>sizeOf(int) = 4</tt>, <tt>sizeOf(double) = 8</tt>, <tt>sizeOf(Object) = 4</tt>, etc. - * Thus, an <tt>OpenIntIntHashMap</tt> with minLoadFactor=0.25 and maxLoadFactor=0.5 and 1000000 associations uses - * between 17 MB and 34 MB. - * The same map with 1000 associations uses between 17 and 34 KB. - * <p> - * </BODY> - * </HTML> - */ -package org.apache.mahout.math.map; http://git-wip-us.apache.org/repos/asf/mahout/blob/e0573de3/math/src/main/java/org/apache/mahout/math/package-info.java ---------------------------------------------------------------------- diff --git a/math/src/main/java/org/apache/mahout/math/package-info.java b/math/src/main/java/org/apache/mahout/math/package-info.java deleted file mode 100644 index de664f0..0000000 --- a/math/src/main/java/org/apache/mahout/math/package-info.java +++ /dev/null @@ -1,4 +0,0 @@ -/** - * Core base classes; Operations on primitive arrays such as sorting, partitioning and permuting. - */ -package org.apache.mahout.math; http://git-wip-us.apache.org/repos/asf/mahout/blob/e0573de3/math/src/main/java/org/apache/mahout/math/random/AbstractSamplerFunction.java ---------------------------------------------------------------------- diff --git a/math/src/main/java/org/apache/mahout/math/random/AbstractSamplerFunction.java b/math/src/main/java/org/apache/mahout/math/random/AbstractSamplerFunction.java deleted file mode 100644 index d657fd9..0000000 --- a/math/src/main/java/org/apache/mahout/math/random/AbstractSamplerFunction.java +++ /dev/null @@ -1,39 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.mahout.math.random; - -import org.apache.mahout.math.function.DoubleFunction; - -/** - * This shim allows samplers to be used to initialize vectors. - */ -public abstract class AbstractSamplerFunction extends DoubleFunction implements Sampler<Double> { - /** - * Apply the function to the argument and return the result - * - * @param ignored Ignored argument - * @return A sample from this distribution. - */ - @Override - public double apply(double ignored) { - return sample(); - } - - @Override - public abstract Double sample(); -} http://git-wip-us.apache.org/repos/asf/mahout/blob/e0573de3/math/src/main/java/org/apache/mahout/math/random/ChineseRestaurant.java ---------------------------------------------------------------------- diff --git a/math/src/main/java/org/apache/mahout/math/random/ChineseRestaurant.java b/math/src/main/java/org/apache/mahout/math/random/ChineseRestaurant.java deleted file mode 100644 index 8127b92..0000000 --- a/math/src/main/java/org/apache/mahout/math/random/ChineseRestaurant.java +++ /dev/null @@ -1,111 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.mahout.math.random; - -import com.google.common.base.Preconditions; -import org.apache.mahout.common.RandomUtils; -import org.apache.mahout.math.list.DoubleArrayList; - -import java.util.Random; - -/** - * - * Generates samples from a generalized Chinese restaurant process (or Pittman-Yor process). - * - * The number of values drawn exactly once will asymptotically be equal to the discount parameter - * as the total number of draws T increases without bound. The number of unique values sampled will - * increase as O(alpha * log T) if discount = 0 or O(alpha * T^discount) for discount > 0. - */ -public final class ChineseRestaurant implements Sampler<Integer> { - - private final double alpha; - private double weight = 0; - private double discount = 0; - private final DoubleArrayList weights = new DoubleArrayList(); - private final Random rand = RandomUtils.getRandom(); - - /** - * Constructs a Dirichlet process sampler. This is done by setting discount = 0. - * @param alpha The strength parameter for the Dirichlet process. - */ - public ChineseRestaurant(double alpha) { - this(alpha, 0); - } - - /** - * Constructs a Pitman-Yor sampler. - * - * @param alpha The strength parameter that drives the number of unique values as a function of draws. - * @param discount The discount parameter that drives the percentage of values that occur once in a large sample. - */ - public ChineseRestaurant(double alpha, double discount) { - Preconditions.checkArgument(alpha > 0, "Strength Parameter, alpha must be greater then 0!"); - Preconditions.checkArgument(discount >= 0 && discount <= 1, "Must be: 0 <= discount <= 1"); - this.alpha = alpha; - this.discount = discount; - } - - @Override - public Integer sample() { - double u = rand.nextDouble() * (alpha + weight); - for (int j = 0; j < weights.size(); j++) { - // select existing options with probability (w_j - d) / (alpha + w) - if (u < weights.get(j) - discount) { - weights.set(j, weights.get(j) + 1); - weight++; - return j; - } else { - u -= weights.get(j) - discount; - } - } - - // if no existing item selected, pick new item with probability (alpha - d*t) / (alpha + w) - // where t is number of pre-existing cases - weights.add(1); - weight++; - return weights.size() - 1; - } - - /** - * @return the number of unique values that have been returned. - */ - public int size() { - return weights.size(); - } - - /** - * @return the number draws so far. - */ - public int count() { - return (int) weight; - } - - /** - * @param j Which value to test. - * @return The number of times that j has been returned so far. - */ - public int count(int j) { - Preconditions.checkArgument(j >= 0); - - if (j < weights.size()) { - return (int) weights.get(j); - } else { - return 0; - } - } -} http://git-wip-us.apache.org/repos/asf/mahout/blob/e0573de3/math/src/main/java/org/apache/mahout/math/random/Empirical.java ---------------------------------------------------------------------- diff --git a/math/src/main/java/org/apache/mahout/math/random/Empirical.java b/math/src/main/java/org/apache/mahout/math/random/Empirical.java deleted file mode 100644 index 78bfec5..0000000 --- a/math/src/main/java/org/apache/mahout/math/random/Empirical.java +++ /dev/null @@ -1,124 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.mahout.math.random; - -import com.google.common.base.Preconditions; -import org.apache.mahout.common.RandomUtils; - -import java.util.Random; - -/** - * Samples from an empirical cumulative distribution. - */ -public final class Empirical extends AbstractSamplerFunction { - private final Random gen; - private final boolean exceedMinimum; - private final boolean exceedMaximum; - - private final double[] x; - private final double[] y; - private final int n; - - /** - * Sets up a sampler for a specified empirical cumulative distribution function. The distribution - * can have optional exponential tails on either or both ends, but otherwise does a linear - * interpolation between known points. - * - * @param exceedMinimum Should we generate samples less than the smallest quantile (i.e. generate a left tail)? - * @param exceedMaximum Should we generate samples greater than the largest observed quantile (i.e. generate a right - * tail)? - * @param samples The number of samples observed to get the quantiles. - * @param ecdf Alternating values that represent which percentile (in the [0..1] range) - * and values. For instance, if you have the min, median and max of 1, 3, 10 - * you should pass 0.0, 1, 0.5, 3, 1.0, 10. Note that the list must include - * the 0-th (1.0-th) quantile if the left (right) tail is not allowed. - */ - public Empirical(boolean exceedMinimum, boolean exceedMaximum, int samples, double... ecdf) { - Preconditions.checkArgument(ecdf.length % 2 == 0, "ecdf must have an even count of values"); - Preconditions.checkArgument(samples >= 3, "Sample size must be >= 3"); - - // if we can't exceed the observed bounds, then we have to be given the bounds. - Preconditions.checkArgument(exceedMinimum || ecdf[0] == 0); - Preconditions.checkArgument(exceedMaximum || ecdf[ecdf.length - 2] == 1); - - gen = RandomUtils.getRandom(); - - n = ecdf.length / 2; - x = new double[n]; - y = new double[n]; - - double lastX = ecdf[1]; - double lastY = ecdf[0]; - for (int i = 0; i < ecdf.length; i += 2) { - // values have to be monotonic increasing - Preconditions.checkArgument(i == 0 || ecdf[i + 1] > lastY); - y[i / 2] = ecdf[i + 1]; - lastY = y[i / 2]; - - // quantiles have to be in [0,1] and be monotonic increasing - Preconditions.checkArgument(ecdf[i] >= 0 && ecdf[i] <= 1); - Preconditions.checkArgument(i == 0 || ecdf[i] > lastX); - - x[i / 2] = ecdf[i]; - lastX = x[i / 2]; - } - - // squeeze a bit to allow for unobserved tails - double x0 = exceedMinimum ? 0.5 / samples : 0; - double x1 = 1 - (exceedMaximum ? 0.5 / samples : 0); - for (int i = 0; i < n; i++) { - x[i] = x[i] * (x1 - x0) + x0; - } - - this.exceedMinimum = exceedMinimum; - this.exceedMaximum = exceedMaximum; - } - - @Override - public Double sample() { - return sample(gen.nextDouble()); - } - - public double sample(double u) { - if (exceedMinimum && u < x[0]) { - // generate from left tail - if (u == 0) { - u = 1.0e-16; - } - return y[0] + Math.log(u / x[0]) * x[0] * (y[1] - y[0]) / (x[1] - x[0]); - } else if (exceedMaximum && u > x[n - 1]) { - if (u == 1) { - u = 1 - 1.0e-16; - } - // generate from right tail - double dy = y[n - 1] - y[n - 2]; - double dx = x[n - 1] - x[n - 2]; - return y[n - 1] - Math.log((1 - u) / (1 - x[n - 1])) * (1 - x[n - 1]) * dy / dx; - } else { - // linear interpolation - for (int i = 1; i < n; i++) { - if (x[i] > u) { - double dy = y[i] - y[i - 1]; - double dx = x[i] - x[i - 1]; - return y[i - 1] + (u - x[i - 1]) * dy / dx; - } - } - throw new RuntimeException(String.format("Can't happen (%.3f is not in [%.3f,%.3f]", u, x[0], x[n - 1])); - } - } -} http://git-wip-us.apache.org/repos/asf/mahout/blob/e0573de3/math/src/main/java/org/apache/mahout/math/random/IndianBuffet.java ---------------------------------------------------------------------- diff --git a/math/src/main/java/org/apache/mahout/math/random/IndianBuffet.java b/math/src/main/java/org/apache/mahout/math/random/IndianBuffet.java deleted file mode 100644 index 27b5d84..0000000 --- a/math/src/main/java/org/apache/mahout/math/random/IndianBuffet.java +++ /dev/null @@ -1,157 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.mahout.math.random; - -import com.google.common.base.CharMatcher; -import com.google.common.base.Charsets; -import com.google.common.base.Splitter; -import com.google.common.collect.Iterables; -import com.google.common.collect.Lists; -import com.google.common.io.LineProcessor; -import com.google.common.io.Resources; -import org.apache.mahout.common.RandomUtils; - -import java.io.IOException; -import java.util.List; -import java.util.Random; - -/** - * Samples a "document" from an IndianBuffet process. - * - * See http://mlg.eng.cam.ac.uk/zoubin/talks/turin09.pdf for details - */ -public final class IndianBuffet<T> implements Sampler<List<T>> { - private final List<Integer> count = Lists.newArrayList(); - private int documents = 0; - private final double alpha; - private WordFunction<T> converter = null; - private final Random gen; - - public IndianBuffet(double alpha, WordFunction<T> converter) { - this.alpha = alpha; - this.converter = converter; - gen = RandomUtils.getRandom(); - } - - public static IndianBuffet<Integer> createIntegerDocumentSampler(double alpha) { - return new IndianBuffet<>(alpha, new IdentityConverter()); - } - - public static IndianBuffet<String> createTextDocumentSampler(double alpha) { - return new IndianBuffet<>(alpha, new WordConverter()); - } - - @Override - public List<T> sample() { - List<T> r = Lists.newArrayList(); - if (documents == 0) { - double n = new PoissonSampler(alpha).sample(); - for (int i = 0; i < n; i++) { - r.add(converter.convert(i)); - count.add(1); - } - documents++; - } else { - documents++; - int i = 0; - for (double cnt : count) { - if (gen.nextDouble() < cnt / documents) { - r.add(converter.convert(i)); - count.set(i, count.get(i) + 1); - } - i++; - } - int newItems = new PoissonSampler(alpha / documents).sample().intValue(); - for (int j = 0; j < newItems; j++) { - r.add(converter.convert(i + j)); - count.add(1); - } - } - return r; - } - - private interface WordFunction<T> { - T convert(int i); - } - - /** - * Just converts to an integer. - */ - public static class IdentityConverter implements WordFunction<Integer> { - @Override - public Integer convert(int i) { - return i; - } - } - - /** - * Converts to a string. - */ - public static class StringConverter implements WordFunction<String> { - @Override - public String convert(int i) { - return String.valueOf(i); - } - } - - /** - * Converts to one of a list of common English words for reasonably small integers and converts - * to a token like w_92463 for big integers. - */ - public static final class WordConverter implements WordFunction<String> { - private final Splitter onSpace = Splitter.on(CharMatcher.WHITESPACE).omitEmptyStrings().trimResults(); - private final List<String> words; - - public WordConverter() { - try { - words = Resources.readLines(Resources.getResource("words.txt"), Charsets.UTF_8, - new LineProcessor<List<String>>() { - private final List<String> theWords = Lists.newArrayList(); - - @Override - public boolean processLine(String line) { - Iterables.addAll(theWords, onSpace.split(line)); - return true; - } - - @Override - public List<String> getResult() { - return theWords; - } - }); - } catch (IOException e) { - throw new ImpossibleException(e); - } - } - - @Override - public String convert(int i) { - if (i < words.size()) { - return words.get(i); - } else { - return "w_" + i; - } - } - } - - public static class ImpossibleException extends RuntimeException { - public ImpossibleException(Throwable e) { - super(e); - } - } -} http://git-wip-us.apache.org/repos/asf/mahout/blob/e0573de3/math/src/main/java/org/apache/mahout/math/random/Missing.java ---------------------------------------------------------------------- diff --git a/math/src/main/java/org/apache/mahout/math/random/Missing.java b/math/src/main/java/org/apache/mahout/math/random/Missing.java deleted file mode 100644 index 8141a71..0000000 --- a/math/src/main/java/org/apache/mahout/math/random/Missing.java +++ /dev/null @@ -1,59 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.mahout.math.random; - -import java.util.Random; - -import org.apache.mahout.common.RandomUtils; - -/** - * Models data with missing values. Note that all variables with the same fraction of missing - * values will have the same sequence of missing values. Similarly, if two variables have - * missing probabilities of p1 > p2, then all of the p2 missing values will also be missing for - * p1. - */ -public final class Missing<T> implements Sampler<T> { - private final Random gen; - private final double p; - private final Sampler<T> delegate; - private final T missingMarker; - - public Missing(int seed, double p, Sampler<T> delegate, T missingMarker) { - this.p = p; - this.delegate = delegate; - this.missingMarker = missingMarker; - gen = RandomUtils.getRandom(seed); - } - - public Missing(double p, Sampler<T> delegate, T missingMarker) { - this(1, p, delegate, missingMarker); - } - - public Missing(double p, Sampler<T> delegate) { - this(1, p, delegate, null); - } - - @Override - public T sample() { - if (gen.nextDouble() >= p) { - return delegate.sample(); - } else { - return missingMarker; - } - } -} http://git-wip-us.apache.org/repos/asf/mahout/blob/e0573de3/math/src/main/java/org/apache/mahout/math/random/MultiNormal.java ---------------------------------------------------------------------- diff --git a/math/src/main/java/org/apache/mahout/math/random/MultiNormal.java b/math/src/main/java/org/apache/mahout/math/random/MultiNormal.java deleted file mode 100644 index 748d4e8..0000000 --- a/math/src/main/java/org/apache/mahout/math/random/MultiNormal.java +++ /dev/null @@ -1,118 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.mahout.math.random; - -import org.apache.mahout.common.RandomUtils; -import org.apache.mahout.math.DenseVector; -import org.apache.mahout.math.DiagonalMatrix; -import org.apache.mahout.math.Matrix; -import org.apache.mahout.math.Vector; -import org.apache.mahout.math.function.DoubleFunction; - -import java.util.Random; - -/** - * Samples from a multi-variate normal distribution. - * <p/> - * This is done by sampling from several independent unit normal distributions to get a vector u. - * The sample value that is returned is then A u + m where A is derived from the covariance matrix - * and m is the mean of the result. - * <p/> - * If \Sigma is the desired covariance matrix, then you can use any value of A such that A' A = - * \Sigma. The Cholesky decomposition can be used to compute A if \Sigma is positive definite. - * Slightly more expensive is to use the SVD U S V' = \Sigma and then set A = U \sqrt{S}. - * - * Useful special cases occur when \Sigma is diagonal so that A = \sqrt(\Sigma) or where \Sigma = r I. - * - * Another special case is where m = 0. - */ -public class MultiNormal implements Sampler<Vector> { - private final Random gen; - private final int dimension; - private final Matrix scale; - private final Vector mean; - - /** - * Constructs a sampler with diagonal scale matrix. - * @param diagonal The diagonal elements of the scale matrix. - */ - public MultiNormal(Vector diagonal) { - this(new DiagonalMatrix(diagonal), null); - } - - /** - * Constructs a sampler with diagonal scale matrix and (potentially) - * non-zero mean. - * @param diagonal The scale matrix's principal diagonal. - * @param mean The desired mean. Set to null if zero mean is desired. - */ - public MultiNormal(Vector diagonal, Vector mean) { - this(new DiagonalMatrix(diagonal), mean); - } - - /** - * Constructs a sampler with non-trivial scale matrix and mean. - */ - public MultiNormal(Matrix a, Vector mean) { - this(a, mean, a.columnSize()); - } - - public MultiNormal(int dimension) { - this(null, null, dimension); - } - - public MultiNormal(double radius, Vector mean) { - this(new DiagonalMatrix(radius, mean.size()), mean); - } - - private MultiNormal(Matrix scale, Vector mean, int dimension) { - gen = RandomUtils.getRandom(); - this.dimension = dimension; - this.scale = scale; - this.mean = mean; - } - - @Override - public Vector sample() { - Vector v = new DenseVector(dimension).assign( - new DoubleFunction() { - @Override - public double apply(double ignored) { - return gen.nextGaussian(); - } - } - ); - if (mean != null) { - if (scale != null) { - return scale.times(v).plus(mean); - } else { - return v.plus(mean); - } - } else { - if (scale != null) { - return scale.times(v); - } else { - return v; - } - } - } - - public Vector getScale() { - return mean; - } -}
