http://git-wip-us.apache.org/repos/asf/mahout/blob/49ad8cb4/core/src/main/java/org/apache/mahout/math/list/package-info.java ---------------------------------------------------------------------- diff --git a/core/src/main/java/org/apache/mahout/math/list/package-info.java b/core/src/main/java/org/apache/mahout/math/list/package-info.java deleted file mode 100644 index 43b5c4b..0000000 --- a/core/src/main/java/org/apache/mahout/math/list/package-info.java +++ /dev/null @@ -1,144 +0,0 @@ -/** - * <HTML> - * <BODY> - * Resizable lists holding objects or primitive data types such as <tt>int</tt>, - * <tt>double</tt>, etc. For non-resizable lists (1-dimensional matrices) see - * package <code>org.apache.mahout.math.matrix</code>.<p></p> - * <h1><a name="Overview"></a>Getting Started</h1> - * <h2>1. Overview</h2> - * <p>The list package offers flexible object oriented abstractions modelling dynamically - * resizing lists holding objects or primitive data types such as <tt>int</tt>, - * <tt>double</tt>, etc. It is designed to be scalable in terms of performance - * and memory requirements.</p> - * <p>Features include: </p> - * <p></p> - * <ul> - * <li>Lists operating on objects as well as all primitive data types such as <tt>int</tt>, - * <tt>double</tt>, etc. - * </li> - * <li>Compact representations</li> - * <li>A number of general purpose list operations including: adding, inserting, - * removing, iterating, searching, sorting, extracting ranges and copying. All - * operations are designed to perform well on mass data. - * </li> - * <li>Support for quick access to list elements. This is achieved by bounds-checking - * and non-bounds-checking accessor methods as well as zero-copy transformations - * to primitive arrays such as <tt>int[]</tt>, <tt>double[]</tt>, etc. - * </li> - * <li>Allows to use high level algorithms on primitive data types without any - * space and time overhead. Operations on primitive arrays, Colt lists and JAL - * algorithms can freely be mixed at zero copy overhead. - * </li> - * </ul> - * <p>File-based I/O can be achieved through the standard Java built-in serialization - * mechanism. All classes implement the {@link java.io.Serializable} interface. - * However, the toolkit is entirely decoupled from advanced I/O. It provides data - * structures and algorithms only. - * <p> This toolkit borrows concepts and terminology from the Javasoft <a - * href="http://www.javasoft.com/products/jdk/1.2/docs/guide/collections/index.html"> - * Collections framework</a> written by Josh Bloch and introduced in JDK 1.2. - * <h2>2. Introduction</h2> - * <p>Lists are fundamental to virtually any application. Large scale resizable lists - * are, for example, used in scientific computations, simulations database management - * systems, to name just a few.</p> - * <h2></h2> - * <p>A list is a container holding elements that can be accessed via zero-based - * indexes. Lists may be implemented in different ways (most commonly with arrays). - * A resizable list automatically grows as elements are added. The lists of this - * package do not automatically shrink. Shrinking needs to be triggered by explicitly - * calling <tt>trimToSize()</tt> methods.</p> - * <p><i>Growing policy</i>: A list implemented with arrays initially has a certain - * <tt>initialCapacity</tt> - per default 10 elements, but customizable upon instance - * construction. As elements are added, this capacity may nomore be sufficient. - * When a list is automatically grown, its capacity is expanded to <tt>1.5*currentCapacity</tt>. - * Thus, excessive resizing (involving copying) is avoided.</p> - * <h4>Copying</h4> - * <p> - * <p>Any list can be copied. A copy is <i>equal</i> to the original but entirely - * independent of the original. So changes in the copy are not reflected in the - * original, and vice-versa. - * <h2>3. Organization of this package</h2> - * <p>Class naming follows the schema <tt><ElementType><ImplementationTechnique>List</tt>. - * For example, we have a {@link org.apache.mahout.math.list.DoubleArrayList}, which is a list - * holding <tt>double</tt> elements implemented with <tt>double</tt>[] arrays. - * </p> - * <p>The classes for lists of a given value type are derived from a common abstract - * base class tagged <tt>Abstract<ElementType></tt><tt>List</tt>. For example, - * all lists operating on <tt>double</tt> elements are derived from - * {@link org.apache.mahout.math.list.AbstractDoubleList}, - * which in turn is derived from an abstract base class tying together all lists - * regardless of value type, {@link org.apache.mahout.math.list.AbstractList}. The abstract - * base classes provide skeleton implementations for all but few methods. Experimental - * data layouts (such as compressed, sparse, linked, etc.) can easily be implemented - * and inherit a rich set of functionality. Have a look at the javadoc <a href="package-tree.html">tree - * view</a> to get the broad picture.</p> - * <h2>4. Example usage</h2> - * <p>The following snippet fills a list, randomizes it, extracts the first half - * of the elements, sums them up and prints the result. It is implemented entirely - * with accessor methods.</p> - * <table> - * <td class="PRE"> - * <pre> - * int s = 1000000;<br>AbstractDoubleList list = new DoubleArrayList(); - * for (int i=0; i<s; i++) { list.add((double)i); } - * list.shuffle(); - * AbstractDoubleList part = list.partFromTo(0,list.size()/2 - 1); - * double sum = 0.0; - * for (int i=0; i<part.size(); i++) { sum += part.get(i); } - * log.info(sum); - * </pre> - * </td> - * </table> - * <p> For efficiency, all classes provide back doors to enable getting/setting the - * backing array directly. In this way, the high level operations of these classes - * can be used where appropriate, and one can switch to <tt>[]</tt>-array index - * notations where necessary. The key methods for this are <tt>public <ElementType>[] - * elements()</tt> and <tt>public void elements(<ElementType>[])</tt>. The - * former trustingly returns the array it internally keeps to store the elements. - * Holding this array in hand, we can use the <tt>[]</tt>-array operator to - * perform iteration over large lists without needing to copy the array or paying - * the performance penalty introduced by accessor methods. Alternatively any JAL - * algorithm (or other algorithm) can operate on the returned primitive array. - * The latter method forces a list to internally hold a user provided array. Using - * this approach one can avoid needing to copy the elements into the list. - * <p>As a consequence, operations on primitive arrays, Colt lists and JAL algorithms - * can freely be mixed at zero-copy overhead. - * <p> Note that such special treatment certainly breaks encapsulation. This functionality - * is provided for performance reasons only and should only be used when absolutely - * necessary. Here is the above example in mixed notation: - * <table> - * <td class="PRE"> - * <pre> - * int s = 1000000;<br>DoubleArrayList list = new DoubleArrayList(s); // list.size()==0, capacity==s - * list.setSize(s); // list.size()==s<br>double[] values = list.elements(); - * // zero copy, values.length==s<br>for (int i=0; i<s; i++) { values[i]=(double)i; } - * list.shuffle(); - * double sum = 0.0; - * int limit = values.length/2; - * for (int i=0; i<limit; i++) { sum += values[i]; } - * log.info(sum); - * </pre> - * </td> - * </table> - * <p> Or even more compact using lists as algorithm objects: - * <table> - * <td class="PRE"> - * <pre> - * int s = 1000000;<br>double[] values = new double[s]; - * for (int i=0; i<s; i++) { values[i]=(double)i; } - * new DoubleArrayList(values).shuffle(); // zero-copy, shuffle via back door - * double sum = 0.0; - * int limit = values.length/2; - * for (int i=0; i<limit; i++) { sum += values[i]; } - * log.info(sum); - * </pre> - * </td> - * </table> - * <p> - * <h2>5. Notes </h2> - * <p>The quicksorts and mergesorts are the JDK 1.2 V1.26 algorithms, modified as - * necessary to operate on the given data types. - * </BODY> - * </HTML> - */ -package org.apache.mahout.math.list;
http://git-wip-us.apache.org/repos/asf/mahout/blob/49ad8cb4/core/src/main/java/org/apache/mahout/math/map/HashFunctions.java ---------------------------------------------------------------------- diff --git a/core/src/main/java/org/apache/mahout/math/map/HashFunctions.java b/core/src/main/java/org/apache/mahout/math/map/HashFunctions.java deleted file mode 100644 index b749307..0000000 --- a/core/src/main/java/org/apache/mahout/math/map/HashFunctions.java +++ /dev/null @@ -1,115 +0,0 @@ -/* -Copyright 1999 CERN - European Organization for Nuclear Research. -Permission to use, copy, modify, distribute and sell this software and its documentation for any purpose -is hereby granted without fee, provided that the above copyright notice appear in all copies and -that both that copyright notice and this permission notice appear in supporting documentation. -CERN makes no representations about the suitability of this software for any purpose. -It is provided "as is" without expressed or implied warranty. -*/ -package org.apache.mahout.math.map; - - -/** - * Provides various hash functions. - */ -public final class HashFunctions { - - /** - * Utility class pattern: all static members, no inheritance. - */ - private HashFunctions() { - } - - /** - * Returns a hashcode for the specified value. - * - * @return a hash code value for the specified value. - */ - public static int hash(char value) { - return value; - } - - /** - * Returns a hashcode for the specified value. - * - * @return a hash code value for the specified value. - */ - public static int hash(double value) { - long bits = Double.doubleToLongBits(value); - return (int) (bits ^ (bits >>> 32)); - - //return (int) Double.doubleToLongBits(value*663608941.737); - // this avoids excessive hashCollisions in the case values are of the form (1.0, 2.0, 3.0, ...) - } - - /** - * Returns a hashcode for the specified value. - * - * @return a hash code value for the specified value. - */ - public static int hash(float value) { - return Float.floatToIntBits(value * 663608941.737f); - // this avoids excessive hashCollisions in the case values are of the form (1.0, 2.0, 3.0, ...) - } - - /** - * Returns a hashcode for the specified value. - * The hashcode computation is similar to the last step - * of MurMurHash3. - * - * @return a hash code value for the specified value. - */ - public static int hash(int value) { - int h = value; - h ^= h >>> 16; - h *= 0x85ebca6b; - h ^= h >>> 13; - h *= 0xc2b2ae35; - h ^= h >>> 16; - return h; - } - - /** - * Returns a hashcode for the specified value. - * - * @return a hash code value for the specified value. - */ - public static int hash(long value) { - return (int) (value ^ (value >> 32)); - /* - value &= 0x7FFFFFFFFFFFFFFFL; // make it >=0 (0x7FFFFFFFFFFFFFFFL==Long.MAX_VALUE) - int hashCode = 0; - do hashCode = 31*hashCode + (int) (value%10); - while ((value /= 10) > 0); - - return 28629151*hashCode; // spread even further; h*31^5 - */ - } - - /** - * Returns a hashcode for the specified object. - * - * @return a hash code value for the specified object. - */ - public static int hash(Object object) { - return object == null ? 0 : object.hashCode(); - } - - /** - * Returns a hashcode for the specified value. - * - * @return a hash code value for the specified value. - */ - public static int hash(short value) { - return value; - } - - /** - * Returns a hashcode for the specified value. - * - * @return a hash code value for the specified value. - */ - public static int hash(boolean value) { - return value ? 1231 : 1237; - } -} http://git-wip-us.apache.org/repos/asf/mahout/blob/49ad8cb4/core/src/main/java/org/apache/mahout/math/map/OpenHashMap.java ---------------------------------------------------------------------- diff --git a/core/src/main/java/org/apache/mahout/math/map/OpenHashMap.java b/core/src/main/java/org/apache/mahout/math/map/OpenHashMap.java deleted file mode 100644 index 271abc1..0000000 --- a/core/src/main/java/org/apache/mahout/math/map/OpenHashMap.java +++ /dev/null @@ -1,654 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -/* -Copyright � 1999 CERN - European Organization for Nuclear Research. -Permission to use, copy, modify, distribute and sell this software and its documentation for any purpose -is hereby granted without fee, provided that the above copyright notice appear in all copies and -that both that copyright notice and this permission notice appear in supporting documentation. -CERN makes no representations about the suitability of this software for any purpose. -It is provided "as is" without expressed or implied warranty. -*/ -package org.apache.mahout.math.map; - -import java.util.ArrayList; -import java.util.Arrays; -import java.util.Collection; -import java.util.List; -import java.util.Map; -import java.util.Set; -// Error building Javadocs if java.util.Map.Entry not explictly included... (?) -import java.util.Map.Entry; - -import org.apache.mahout.math.function.ObjectObjectProcedure; -import org.apache.mahout.math.function.ObjectProcedure; -import org.apache.mahout.math.set.AbstractSet; -import org.apache.mahout.math.set.OpenHashSet; - -/** - * Open hash map. This implements Map, but it does not respect several aspects of the Map contract - * that impose the very sorts of performance penalities that this class exists to avoid. - * {@link #entrySet}, {@link #values}, and {@link #keySet()} do <strong>not</strong> return - * collections that share storage with the main map, and changes to those returned objects - * are <strong>not</strong> reflected in the container. - **/ -public class OpenHashMap<K,V> extends AbstractSet implements Map<K,V> { - protected static final byte FREE = 0; - protected static final byte FULL = 1; - protected static final byte REMOVED = 2; - protected static final Object NO_KEY_VALUE = null; - - /** The hash table keys. */ - protected Object[] table; - - /** The hash table values. */ - protected Object[] values; - - /** The state of each hash table entry (FREE, FULL, REMOVED). */ - protected byte[] state; - - /** The number of table entries in state==FREE. */ - protected int freeEntries; - - - /** Constructs an empty map with default capacity and default load factors. */ - public OpenHashMap() { - this(DEFAULT_CAPACITY); - } - - /** - * Constructs an empty map with the specified initial capacity and default load factors. - * - * @param initialCapacity the initial capacity of the map. - * @throws IllegalArgumentException if the initial capacity is less than zero. - */ - public OpenHashMap(int initialCapacity) { - this(initialCapacity, DEFAULT_MIN_LOAD_FACTOR, DEFAULT_MAX_LOAD_FACTOR); - } - - /** - * Constructs an empty map with the specified initial capacity and the specified minimum and maximum load factor. - * - * @param initialCapacity the initial capacity. - * @param minLoadFactor the minimum load factor. - * @param maxLoadFactor the maximum load factor. - * @throws IllegalArgumentException if <tt>initialCapacity < 0 || (minLoadFactor < 0.0 || minLoadFactor >= 1.0) || - * (maxLoadFactor <= 0.0 || maxLoadFactor >= 1.0) || (minLoadFactor >= - * maxLoadFactor)</tt>. - */ - public OpenHashMap(int initialCapacity, double minLoadFactor, double maxLoadFactor) { - setUp(initialCapacity, minLoadFactor, maxLoadFactor); - } - - /** Removes all (key,value) associations from the receiver. Implicitly calls <tt>trimToSize()</tt>. */ - @Override - public void clear() { - Arrays.fill(this.state, FREE); - distinct = 0; - freeEntries = table.length; // delta - trimToSize(); - } - - /** - * Returns a deep copy of the receiver. - * - * @return a deep copy of the receiver. - */ - @Override - @SuppressWarnings("unchecked") - public Object clone() { - OpenHashMap<K,V> copy = (OpenHashMap<K,V>) super.clone(); - copy.table = copy.table.clone(); - copy.values = copy.values.clone(); - copy.state = copy.state.clone(); - return copy; - } - - /** - * Returns <tt>true</tt> if the receiver contains the specified key. - * - * @return <tt>true</tt> if the receiver contains the specified key. - */ - @SuppressWarnings("unchecked") - @Override - public boolean containsKey(Object key) { - return indexOfKey((K)key) >= 0; - } - - /** - * Returns <tt>true</tt> if the receiver contains the specified value. - * - * @return <tt>true</tt> if the receiver contains the specified value. - */ - @SuppressWarnings("unchecked") - @Override - public boolean containsValue(Object value) { - return indexOfValue((V)value) >= 0; - } - - /** - * Ensures that the receiver can hold at least the specified number of associations without needing to allocate new - * internal memory. If necessary, allocates new internal memory and increases the capacity of the receiver. <p> This - * method never need be called; it is for performance tuning only. Calling this method before <tt>put()</tt>ing a - * large number of associations boosts performance, because the receiver will grow only once instead of potentially - * many times and hash collisions get less probable. - * - * @param minCapacity the desired minimum capacity. - */ - @Override - public void ensureCapacity(int minCapacity) { - if (table.length < minCapacity) { - int newCapacity = nextPrime(minCapacity); - rehash(newCapacity); - } - } - - /** - * Applies a procedure to each key of the receiver, if any. Note: Iterates over the keys in no particular order. - * Subclasses can define a particular order, for example, "sorted by key". All methods which <i>can</i> be expressed - * in terms of this method (most methods can) <i>must guarantee</i> to use the <i>same</i> order defined by this - * method, even if it is no particular order. This is necessary so that, for example, methods <tt>keys</tt> and - * <tt>values</tt> will yield association pairs, not two uncorrelated lists. - * - * @param procedure the procedure to be applied. Stops iteration if the procedure returns <tt>false</tt>, otherwise - * continues. - * @return <tt>false</tt> if the procedure stopped before all keys where iterated over, <tt>true</tt> otherwise. - */ - @SuppressWarnings("unchecked") - public boolean forEachKey(ObjectProcedure<K> procedure) { - for (int i = table.length; i-- > 0;) { - if (state[i] == FULL && !procedure.apply((K)table[i])) { - return false; - } - } - return true; - } - - /** - * Applies a procedure to each (key,value) pair of the receiver, if any. Iteration order is guaranteed to be - * <i>identical</i> to the order used by method {@link #forEachKey(ObjectProcedure)}. - * - * @param procedure the procedure to be applied. Stops iteration if the procedure returns <tt>false</tt>, otherwise - * continues. - * @return <tt>false</tt> if the procedure stopped before all keys where iterated over, <tt>true</tt> otherwise. - */ - @SuppressWarnings("unchecked") - public boolean forEachPair(ObjectObjectProcedure<K,V> procedure) { - for (int i = table.length; i-- > 0;) { - if (state[i] == FULL && !procedure.apply((K)table[i], (V)values[i])) { - return false; - } - } - return true; - } - - /** - * Returns the value associated with the specified key. It is often a good idea to first check with {@link - * #containsKey(Object)} whether the given key has a value associated or not, i.e. whether there exists an association - * for the given key or not. - * - * @param key the key to be searched for. - * @return the value associated with the specified key; <tt>0</tt> if no such key is present. - */ - @SuppressWarnings("unchecked") - @Override - public V get(Object key) { - int i = indexOfKey((K)key); - if (i < 0) { - return null; - } //not contained - return (V)values[i]; - } - - /** - * @param key the key to be added to the receiver. - * @return the index where the key would need to be inserted, if it is not already contained. Returns -index-1 if the - * key is already contained at slot index. Therefore, if the returned index < 0, then it is already contained - * at slot -index-1. If the returned index >= 0, then it is NOT already contained and should be inserted at - * slot index. - */ - protected int indexOfInsertion(K key) { - Object[] tab = table; - byte[] stat = state; - int length = tab.length; - - int hash = key.hashCode() & 0x7FFFFFFF; - int i = hash % length; - int decrement = hash % (length - 2); // double hashing, see http://www.eece.unm.edu/faculty/heileman/hash/node4.html - //int decrement = (hash / length) % length; - if (decrement == 0) { - decrement = 1; - } - - // stop if we find a removed or free slot, or if we find the key itself - // do NOT skip over removed slots (yes, open addressing is like that...) - while (stat[i] == FULL && !equalsMindTheNull(key, tab[i])) { - i -= decrement; - //hashCollisions++; - if (i < 0) { - i += length; - } - } - - if (stat[i] == REMOVED) { - // stop if we find a free slot, or if we find the key itself. - // do skip over removed slots (yes, open addressing is like that...) - // assertion: there is at least one FREE slot. - int j = i; - while (stat[i] != FREE && (stat[i] == REMOVED || tab[i] != key)) { - i -= decrement; - //hashCollisions++; - if (i < 0) { - i += length; - } - } - if (stat[i] == FREE) { - i = j; - } - } - - - if (stat[i] == FULL) { - // key already contained at slot i. - // return a negative number identifying the slot. - return -i - 1; - } - // not already contained, should be inserted at slot i. - // return a number >= 0 identifying the slot. - return i; - } - - /** - * @param key the key to be searched in the receiver. - * @return the index where the key is contained in the receiver, returns -1 if the key was not found. - */ - protected int indexOfKey(K key) { - Object[] tab = table; - byte[] stat = state; - int length = tab.length; - - int hash = key.hashCode() & 0x7FFFFFFF; - int i = hash % length; - int decrement = hash % (length - 2); // double hashing, see http://www.eece.unm.edu/faculty/heileman/hash/node4.html - //int decrement = (hash / length) % length; - if (decrement == 0) { - decrement = 1; - } - - // stop if we find a free slot, or if we find the key itself. - // do skip over removed slots (yes, open addressing is like that...) - while (stat[i] != FREE && (stat[i] == REMOVED || !equalsMindTheNull(key, tab[i]))) { - i -= decrement; - //hashCollisions++; - if (i < 0) { - i += length; - } - } - - if (stat[i] == FREE) { - return -1; - } // not found - return i; //found, return index where key is contained - } - - /** - * @param value the value to be searched in the receiver. - * @return the index where the value is contained in the receiver, returns -1 if the value was not found. - */ - protected int indexOfValue(V value) { - Object[] val = values; - byte[] stat = state; - - for (int i = stat.length; --i >= 0;) { - if (stat[i] == FULL && equalsMindTheNull(val[i], value)) { - return i; - } - } - - return -1; // not found - } - - /** - * Fills all keys contained in the receiver into the specified list. Fills the list, starting at index 0. After this - * call returns the specified list has a new size that equals <tt>this.size()</tt>. - * This method can be used - * to iterate over the keys of the receiver. - * - * @param list the list to be filled, can have any size. - */ - @SuppressWarnings("unchecked") - public void keys(List<K> list) { - list.clear(); - - - Object [] tab = table; - byte[] stat = state; - - for (int i = tab.length; i-- > 0;) { - if (stat[i] == FULL) { - list.add((K)tab[i]); - } - } - } - - /** - * Associates the given key with the given value. Replaces any old <tt>(key,someOtherValue)</tt> association, if - * existing. - * - * @param key the key the value shall be associated with. - * @param value the value to be associated. - * @return <tt>true</tt> if the receiver did not already contain such a key; <tt>false</tt> if the receiver did - * already contain such a key - the new value has now replaced the formerly associated value. - */ - @SuppressWarnings("unchecked") - @Override - public V put(K key, V value) { - int i = indexOfInsertion(key); - if (i < 0) { //already contained - i = -i - 1; - V previous = (V) this.values[i]; - this.values[i] = value; - return previous; - } - - if (this.distinct > this.highWaterMark) { - int newCapacity = chooseGrowCapacity(this.distinct + 1, this.minLoadFactor, this.maxLoadFactor); - rehash(newCapacity); - return put(key, value); - } - - this.table[i] = key; - this.values[i] = value; - if (this.state[i] == FREE) { - this.freeEntries--; - } - this.state[i] = FULL; - this.distinct++; - - if (this.freeEntries < 1) { //delta - int newCapacity = chooseGrowCapacity(this.distinct + 1, this.minLoadFactor, this.maxLoadFactor); - rehash(newCapacity); - } - - return null; - } - - /** - * Rehashes the contents of the receiver into a new table with a smaller or larger capacity. This method is called - * automatically when the number of keys in the receiver exceeds the high water mark or falls below the low water - * mark. - */ - @SuppressWarnings("unchecked") - protected void rehash(int newCapacity) { - int oldCapacity = table.length; - //if (oldCapacity == newCapacity) return; - - Object[] oldTable = table; - Object[] oldValues = values; - byte[] oldState = state; - - Object[] newTable = new Object[newCapacity]; - Object[] newValues = new Object[newCapacity]; - byte[] newState = new byte[newCapacity]; - - this.lowWaterMark = chooseLowWaterMark(newCapacity, this.minLoadFactor); - this.highWaterMark = chooseHighWaterMark(newCapacity, this.maxLoadFactor); - - this.table = newTable; - this.values = newValues; - this.state = newState; - this.freeEntries = newCapacity - this.distinct; // delta - - for (int i = oldCapacity; i-- > 0;) { - if (oldState[i] == FULL) { - Object element = oldTable[i]; - int index = indexOfInsertion((K)element); - newTable[index] = element; - newValues[index] = oldValues[i]; - newState[index] = FULL; - } - } - } - - /** - * Removes the given key with its associated element from the receiver, if present. - * - * @param key the key to be removed from the receiver. - * @return <tt>true</tt> if the receiver contained the specified key, <tt>false</tt> otherwise. - */ - @SuppressWarnings("unchecked") - @Override - public V remove(Object key) { - int i = indexOfKey((K)key); - if (i < 0) { - return null; - } - // key not contained - V removed = (V) values[i]; - - this.state[i] = REMOVED; - //this.values[i]=0; // delta - this.distinct--; - - if (this.distinct < this.lowWaterMark) { - int newCapacity = chooseShrinkCapacity(this.distinct, this.minLoadFactor, this.maxLoadFactor); - rehash(newCapacity); - } - - return removed; - } - - /** - * Initializes the receiver. - * - * @param initialCapacity the initial capacity of the receiver. - * @param minLoadFactor the minLoadFactor of the receiver. - * @param maxLoadFactor the maxLoadFactor of the receiver. - * @throws IllegalArgumentException if <tt>initialCapacity < 0 || (minLoadFactor < 0.0 || minLoadFactor >= 1.0) || - * (maxLoadFactor <= 0.0 || maxLoadFactor >= 1.0) || (minLoadFactor >= - * maxLoadFactor)</tt>. - */ - @Override - protected void setUp(int initialCapacity, double minLoadFactor, double maxLoadFactor) { - int capacity = initialCapacity; - super.setUp(capacity, minLoadFactor, maxLoadFactor); - capacity = nextPrime(capacity); - if (capacity == 0) { - capacity = 1; - } // open addressing needs at least one FREE slot at any time. - - this.table = new Object[capacity]; - this.values = new Object[capacity]; - this.state = new byte[capacity]; - - // memory will be exhausted long before this pathological case happens, anyway. - this.minLoadFactor = minLoadFactor; - if (capacity == PrimeFinder.LARGEST_PRIME) { - this.maxLoadFactor = 1.0; - } else { - this.maxLoadFactor = maxLoadFactor; - } - - this.distinct = 0; - this.freeEntries = capacity; // delta - - // lowWaterMark will be established upon first expansion. - // establishing it now (upon instance construction) would immediately make the table shrink upon first put(...). - // After all the idea of an "initialCapacity" implies violating lowWaterMarks when an object is young. - // See ensureCapacity(...) - this.lowWaterMark = 0; - this.highWaterMark = chooseHighWaterMark(capacity, this.maxLoadFactor); - } - - /** - * Trims the capacity of the receiver to be the receiver's current size. Releases any superfluous internal memory. An - * application can use this operation to minimize the storage of the receiver. - */ - @Override - public void trimToSize() { - // * 1.2 because open addressing's performance exponentially degrades beyond that point - // so that even rehashing the table can take very long - int newCapacity = nextPrime((int) (1 + 1.2 * size())); - if (table.length > newCapacity) { - rehash(newCapacity); - } - } - - /** - * Access for unit tests. - * @param capacity - * @param minLoadFactor - * @param maxLoadFactor - */ - void getInternalFactors(int[] capacity, - double[] minLoadFactor, - double[] maxLoadFactor) { - capacity[0] = table.length; - minLoadFactor[0] = this.minLoadFactor; - maxLoadFactor[0] = this.maxLoadFactor; - } - - private class MapEntry implements Entry<K,V> { - private final K key; - private final V value; - - MapEntry(K key, V value) { - this.key = key; - this.value = value; - } - - @Override - public K getKey() { - return key; - } - - @Override - public V getValue() { - return value; - } - - @Override - public V setValue(V value) { - throw new UnsupportedOperationException("Map.Entry.setValue not supported for OpenHashMap"); - } - - } - - /** - * Allocate a set to contain Map.Entry objects for the pairs and return it. - */ - @Override - public Set<Entry<K,V>> entrySet() { - final Set<Entry<K, V>> entries = new OpenHashSet<>(); - forEachPair(new ObjectObjectProcedure<K,V>() { - @Override - public boolean apply(K key, V value) { - entries.add(new MapEntry(key, value)); - return true; - } - }); - return entries; - } - - /** - * Allocate a set to contain keys and return it. - * This violates the 'backing' provisions of the map interface. - */ - @Override - public Set<K> keySet() { - final Set<K> keys = new OpenHashSet<>(); - forEachKey(new ObjectProcedure<K>() { - @Override - public boolean apply(K element) { - keys.add(element); - return true; - } - }); - return keys; - } - - @Override - public void putAll(Map<? extends K,? extends V> m) { - for (Entry<? extends K, ? extends V> e : m.entrySet()) { - put(e.getKey(), e.getValue()); - } - } - - /** - * Allocate a list to contain the values and return it. - * This violates the 'backing' provision of the Map interface. - */ - @Override - public Collection<V> values() { - final List<V> valueList = new ArrayList<>(); - forEachPair(new ObjectObjectProcedure<K,V>() { - @Override - public boolean apply(K key, V value) { - valueList.add(value); - return true; - } - }); - return valueList; - } - - @SuppressWarnings("unchecked") - @Override - public boolean equals(Object obj) { - if (!(obj instanceof OpenHashMap)) { - return false; - } - final OpenHashMap<K,V> o = (OpenHashMap<K,V>) obj; - if (o.size() != size()) { - return false; - } - final boolean[] equal = new boolean[1]; - equal[0] = true; - forEachPair(new ObjectObjectProcedure<K,V>() { - @Override - public boolean apply(K key, V value) { - Object ov = o.get(key); - if (!value.equals(ov)) { - equal[0] = false; - return false; - } - return true; - } - }); - return equal[0]; - } - - @Override - public String toString() { - final StringBuilder sb = new StringBuilder(); - sb.append('{'); - forEachPair(new ObjectObjectProcedure<K,V>() { - @Override - public boolean apply(K key, V value) { - sb.append('['); - sb.append(key); - sb.append(" -> "); - sb.append(value); - sb.append("] "); - return true; - } - }); - sb.append('}'); - return sb.toString(); - } -} http://git-wip-us.apache.org/repos/asf/mahout/blob/49ad8cb4/core/src/main/java/org/apache/mahout/math/map/PrimeFinder.java ---------------------------------------------------------------------- diff --git a/core/src/main/java/org/apache/mahout/math/map/PrimeFinder.java b/core/src/main/java/org/apache/mahout/math/map/PrimeFinder.java deleted file mode 100644 index 8f32e86..0000000 --- a/core/src/main/java/org/apache/mahout/math/map/PrimeFinder.java +++ /dev/null @@ -1,145 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -package org.apache.mahout.math.map; - -import java.util.Arrays; - -/** - * Not of interest for users; only for implementors of hashtables. - * Used to keep hash table capacities prime numbers. - * - * <p>Choosing prime numbers as hash table capacities is a good idea to keep them working fast, - * particularly under hash table expansions. - * - * <p>However, JDK 1.2, JGL 3.1 and many other toolkits do nothing to keep capacities prime. - * This class provides efficient means to choose prime capacities. - * - * <p>Choosing a prime is <tt>O(log 300)</tt> (binary search in a list of 300 int's). - * Memory requirements: 1 KB static memory. - * - */ -public final class PrimeFinder { - - /** The largest prime this class can generate; currently equal to <tt>Integer.MAX_VALUE</tt>. */ - public static final int LARGEST_PRIME = Integer.MAX_VALUE; //yes, it is prime. - - /** - * The prime number list consists of 11 chunks. Each chunk contains prime numbers. A chunk starts with a prime P1. The - * next element is a prime P2. P2 is the smallest prime for which holds: P2 >= 2*P1. The next element is P3, for which - * the same holds with respect to P2, and so on. - * - * Chunks are chosen such that for any desired capacity >= 1000 the list includes a prime number <= desired capacity * - * 1.11 (11%). For any desired capacity >= 200 the list includes a prime number <= desired capacity * 1.16 (16%). For - * any desired capacity >= 16 the list includes a prime number <= desired capacity * 1.21 (21%). - * - * Therefore, primes can be retrieved which are quite close to any desired capacity, which in turn avoids wasting - * memory. For example, the list includes 1039,1117,1201,1277,1361,1439,1523,1597,1759,1907,2081. So if you need a - * prime >= 1040, you will find a prime <= 1040*1.11=1154. - * - * Chunks are chosen such that they are optimized for a hashtable growthfactor of 2.0; If your hashtable has such a - * growthfactor then, after initially "rounding to a prime" upon hashtable construction, it will later expand to prime - * capacities such that there exist no better primes. - * - * In total these are about 32*10=320 numbers -> 1 KB of static memory needed. If you are stingy, then delete every - * second or fourth chunk. - */ - - private static final int[] PRIME_CAPACITIES = { - //chunk #0 - LARGEST_PRIME, - - //chunk #1 - 5, 11, 23, 47, 97, 197, 397, 797, 1597, 3203, 6421, 12853, 25717, 51437, 102877, 205759, - 411527, 823117, 1646237, 3292489, 6584983, 13169977, 26339969, 52679969, 105359939, - 210719881, 421439783, 842879579, 1685759167, - - //chunk #2 - 433, 877, 1759, 3527, 7057, 14143, 28289, 56591, 113189, 226379, 452759, 905551, 1811107, - 3622219, 7244441, 14488931, 28977863, 57955739, 115911563, 231823147, 463646329, 927292699, - 1854585413, - - //chunk #3 - 953, 1907, 3821, 7643, 15287, 30577, 61169, 122347, 244703, 489407, 978821, 1957651, 3915341, - 7830701, 15661423, 31322867, 62645741, 125291483, 250582987, 501165979, 1002331963, - 2004663929, - - //chunk #4 - 1039, 2081, 4177, 8363, 16729, 33461, 66923, 133853, 267713, 535481, 1070981, 2141977, 4283963, - 8567929, 17135863, 34271747, 68543509, 137087021, 274174111, 548348231, 1096696463, - - //chunk #5 - 31, 67, 137, 277, 557, 1117, 2237, 4481, 8963, 17929, 35863, 71741, 143483, 286973, 573953, - 1147921, 2295859, 4591721, 9183457, 18366923, 36733847, 73467739, 146935499, 293871013, - 587742049, 1175484103, - - //chunk #6 - 599, 1201, 2411, 4831, 9677, 19373, 38747, 77509, 155027, 310081, 620171, 1240361, 2480729, - 4961459, 9922933, 19845871, 39691759, 79383533, 158767069, 317534141, 635068283, 1270136683, - - //chunk #7 - 311, 631, 1277, 2557, 5119, 10243, 20507, 41017, 82037, 164089, 328213, 656429, 1312867, - 2625761, 5251529, 10503061, 21006137, 42012281, 84024581, 168049163, 336098327, 672196673, - 1344393353, - - //chunk #8 - 3, 7, 17, 37, 79, 163, 331, 673, 1361, 2729, 5471, 10949, 21911, 43853, 87719, 175447, 350899, - 701819, 1403641, 2807303, 5614657, 11229331, 22458671, 44917381, 89834777, 179669557, - 359339171, 718678369, 1437356741, - - //chunk #9 - 43, 89, 179, 359, 719, 1439, 2879, 5779, 11579, 23159, 46327, 92657, 185323, 370661, 741337, - 1482707, 2965421, 5930887, 11861791, 23723597, 47447201, 94894427, 189788857, 379577741, - 759155483, 1518310967, - - //chunk #10 - 379, 761, 1523, 3049, 6101, 12203, 24407, 48817, 97649, 195311, 390647, 781301, 1562611, - 3125257, 6250537, 12501169, 25002389, 50004791, 100009607, 200019221, 400038451, 800076929, - 1600153859 - }; - - - static { //initializer - // The above prime numbers are formatted for human readability. - // To find numbers fast, we sort them once and for all. - - Arrays.sort(PRIME_CAPACITIES); - } - - /** Makes this class non instantiable, but still let's others inherit from it. */ - private PrimeFinder() { - } - - /** - * Returns a prime number which is {@code <= desiredCapacity} and very close to {@code desiredCapacity} - * (within 11% if {@code desiredCapacity <= 1000}). - * - * @param desiredCapacity the capacity desired by the user. - * @return the capacity which should be used for a hashtable. - */ - public static int nextPrime(int desiredCapacity) { - int i = Arrays.binarySearch(PRIME_CAPACITIES, desiredCapacity); - if (i < 0) { - // desired capacity not found, choose next prime greater than desired capacity - i = -i - 1; // remember the semantics of binarySearch... - } - return PRIME_CAPACITIES[i]; - } - -} http://git-wip-us.apache.org/repos/asf/mahout/blob/49ad8cb4/core/src/main/java/org/apache/mahout/math/map/QuickOpenIntIntHashMap.java ---------------------------------------------------------------------- diff --git a/core/src/main/java/org/apache/mahout/math/map/QuickOpenIntIntHashMap.java b/core/src/main/java/org/apache/mahout/math/map/QuickOpenIntIntHashMap.java deleted file mode 100644 index a5a54af..0000000 --- a/core/src/main/java/org/apache/mahout/math/map/QuickOpenIntIntHashMap.java +++ /dev/null @@ -1,215 +0,0 @@ -///* -//Copyright � 1999 CERN - European Organization for Nuclear Research. -//Permission to use, copy, modify, distribute and sell this software and its documentation for any purpose -//is hereby granted without fee, provided that the above copyright notice appear in all copies and -//that both that copyright notice and this permission notice appear in supporting documentation. -//CERN makes no representations about the suitability of this software for any purpose. -//It is provided "as is" without expressed or implied warranty. -//*/ -//package org.apache.mahout.math.map; -// -///** -// * Status: Experimental; Do not use for production yet. Hash map holding (key,value) associations of type -// * <tt>(int-->int)</tt>; Automatically grows and shrinks as needed; Implemented using open addressing with double -// * hashing. First see the <a href="package-summary.html">package summary</a> and javadoc <a -// * href="package-tree.html">tree view</a> to get the broad picture. -// * -// * Implements open addressing with double hashing, using "Brent's variation". Brent's variation slows insertions a bit -// * down (not much) but reduces probes (collisions) for successful searches, in particular for large load factors. (It -// * does not improve unsuccessful searches.) See D. Knuth, Searching and Sorting, 3rd ed., p.533-545 -// * -// * @author [email protected] -// * @version 1.0, 09/24/99 -// * @see java.util.HashMap -// */ -//class QuickOpenIntIntHashMap extends OpenIntIntHashMap { -// //public int totalProbesSaved = 0; // benchmark only -// -// /** Constructs an empty map with default capacity and default load factors. */ -// QuickOpenIntIntHashMap() { -// this(DEFAULT_CAPACITY); -// } -// -// /** -// * Constructs an empty map with the specified initial capacity and default load factors. -// * -// * @param initialCapacity the initial capacity of the map. -// * @throws IllegalArgumentException if the initial capacity is less than zero. -// */ -// QuickOpenIntIntHashMap(int initialCapacity) { -// this(initialCapacity, DEFAULT_MIN_LOAD_FACTOR, DEFAULT_MAX_LOAD_FACTOR); -// } -// -// /** -// * Constructs an empty map with the specified initial capacity and the specified minimum and maximum load factor. -// * -// * @param initialCapacity the initial capacity. -// * @param minLoadFactor the minimum load factor. -// * @param maxLoadFactor the maximum load factor. -// * @throws IllegalArgumentException if <tt>initialCapacity < 0 || (minLoadFactor < 0.0 || minLoadFactor >= 1.0) || -// * (maxLoadFactor <= 0.0 || maxLoadFactor >= 1.0) || (minLoadFactor >= -// * maxLoadFactor)</tt>. -// */ -// QuickOpenIntIntHashMap(int initialCapacity, double minLoadFactor, double maxLoadFactor) { -// setUp(initialCapacity, minLoadFactor, maxLoadFactor); -// } -// -// /** -// * Associates the given key with the given value. Replaces any old <tt>(key,someOtherValue)</tt> association, if -// * existing. -// * -// * @param key the key the value shall be associated with. -// * @param value the value to be associated. -// * @return <tt>true</tt> if the receiver did not already contain such a key; <tt>false</tt> if the receiver did -// * already contain such a key - the new value has now replaced the formerly associated value. -// */ -// @Override -// public boolean put(int key, int value) { -// /* -// This is open addressing with double hashing, using "Brent's variation". -// Brent's variation slows insertions a bit down (not much) but reduces probes (collisions) for successful searches, -// in particular for large load factors. -// (It does not improve unsuccessful searches.) -// See D. Knuth, Searching and Sorting, 3rd ed., p.533-545 -// -// h1(key) = hash % M -// h2(key) = decrement = Max(1, hash/M % M) -// M is prime = capacity = table.length -// probing positions are table[(h1-j*h2) % M] for j=0,1,... -// (M and h2 could also be chosen differently, but h2 is required to be relative prime to M.) -// */ -// -// int[] tab = table; -// byte[] stat = state; -// int length = tab.length; -// -// int hash = HashFunctions.hash(key) & 0x7FFFFFFF; -// int i = hash % length; -// int decrement = (hash / length) % length; -// if (decrement == 0) { -// decrement = 1; -// } -// -// // stop if we find a removed or free slot, or if we find the key itself -// // do NOT skip over removed slots (yes, open addressing is like that...) -// //int comp = comparisons; -// int t = 0; // the number of probes -// int p0 = i; // the first position to probe -// while (stat[i] == FULL && tab[i] != key) { -// t++; -// i -= decrement; -// //hashCollisions++; -// if (i < 0) { -// i += length; -// } -// } -// if (stat[i] == FULL) { -// // key already contained at slot i. -// this.values[i] = value; -// return false; -// } -// // not already contained, should be inserted at slot i. -// -// if (this.distinct > this.highWaterMark) { -// int newCapacity = chooseGrowCapacity(this.distinct + 1, this.minLoadFactor, this.maxLoadFactor); -// rehash(newCapacity); -// return put(key, value); -// } -// -// /* -// Brent's variation does a local reorganization to reduce probes. It essentially means: -// We test whether it is possible to move the association we probed first (table[p0]) out of the way. -// If this is possible, it will reduce probes for the key to be inserted, since it takes its place; -// it gets hit earlier. -// However, future probes for the key that we move out of the way will increase. -// Thus we only move it out of the way, if we have a net gain, that is, if we save more probes than we loose. -// For the first probe we safe more than we loose if the number of probes we needed was >=2 (t>=2). -// If the first probe cannot be moved out of the way, we try the next probe (p1). -// Now we safe more than we loose if t>=3. -// We repeat this until we find that we cannot gain or that we can indeed move p(x) out of the way. -// -// Note: Under the great majority of insertions t<=1, so the loop is entered very infrequently. -// */ -// while (t > 1) { -// int key0 = tab[p0]; -// hash = HashFunctions.hash(key0) & 0x7FFFFFFF; -// decrement = (hash / length) % length; -// if (decrement == 0) { -// decrement = 1; -// } -// int pc = p0 - decrement; // pc = (p0-j*decrement) % M, j=1,2,.. -// if (pc < 0) { -// pc += length; -// } -// -// if (stat[pc] != FREE) { // not a free slot, continue searching for free slot to move to, or break. -// p0 = pc; -// t--; -// } else { // free or removed slot found, now move... -// tab[pc] = key0; -// stat[pc] = FULL; -// values[pc] = values[p0]; -// i = p0; // prepare to insert: table[p0]=key -// t = 0; // break loop -// } -// } -// -// this.table[i] = key; -// this.values[i] = value; -// if (this.state[i] == FREE) { -// this.freeEntries--; -// } -// this.state[i] = FULL; -// this.distinct++; -// -// if (this.freeEntries < 1) { //delta -// int newCapacity = chooseGrowCapacity(this.distinct + 1, this.minLoadFactor, this.maxLoadFactor); -// rehash(newCapacity); -// } -// -// return true; -// } -// -// /** -// * Rehashes the contents of the receiver into a new table with a smaller or larger capacity. This method is called -// * automatically when the number of keys in the receiver exceeds the high water mark or falls below the low water -// * mark. -// */ -// @Override -// protected void rehash(int newCapacity) { -// int oldCapacity = table.length; -// //if (oldCapacity == newCapacity) return; -// -// int[] oldTable = table; -// int[] oldValues = values; -// byte[] oldState = state; -// -// int[] newTable = new int[newCapacity]; -// int[] newValues = new int[newCapacity]; -// byte[] newState = new byte[newCapacity]; -// -// this.lowWaterMark = chooseLowWaterMark(newCapacity, this.minLoadFactor); -// this.highWaterMark = chooseHighWaterMark(newCapacity, this.maxLoadFactor); -// -// this.table = newTable; -// this.values = newValues; -// this.state = newState; -// this.freeEntries = newCapacity - this.distinct; // delta -// -// int tmp = this.distinct; -// this.distinct = Integer.MIN_VALUE; // switch of watermarks -// for (int i = oldCapacity; i-- > 0;) { -// if (oldState[i] == FULL) { -// put(oldTable[i], oldValues[i]); -// /* -// int element = oldTable[i]; -// int index = indexOfInsertion(element); -// newTable[index]=element; -// newValues[index]=oldValues[i]; -// newState[index]=FULL; -// */ -// } -// } -// this.distinct = tmp; -// } -//} http://git-wip-us.apache.org/repos/asf/mahout/blob/49ad8cb4/core/src/main/java/org/apache/mahout/math/map/package-info.java ---------------------------------------------------------------------- diff --git a/core/src/main/java/org/apache/mahout/math/map/package-info.java b/core/src/main/java/org/apache/mahout/math/map/package-info.java deleted file mode 100644 index 9356f45..0000000 --- a/core/src/main/java/org/apache/mahout/math/map/package-info.java +++ /dev/null @@ -1,250 +0,0 @@ -/** - * <HTML> - * <BODY> - * Automatically growing and shrinking maps holding objects or primitive - * data types such as <tt>int</tt>, <tt>double</tt>, etc. Currently all maps are - * based upon hashing. - * <h2><a name="Overview"></a>1. Overview</h2> - * <p>The map package offers flexible object oriented abstractions modelling automatically - * resizing maps. It is designed to be scalable in terms of performance and memory - * requirements.</p> - * <p>Features include: </p> - * <p></p> - * <ul> - * <li>Maps operating on objects as well as all primitive data types such as <code>int</code>, - * <code>double</code>, etc. - * </li> - * <li>Compact representations</li> - * <li>Support for quick access to associations</li> - * <li>A number of general purpose map operations</li> - * </ul> - * <p>File-based I/O can be achieved through the standard Java built-in serialization - * mechanism. All classes implement the {@link java.io.Serializable} interface. - * However, the toolkit is entirely decoupled from advanced I/O. It provides data - * structures and algorithms only. - * <p> This toolkit borrows some terminology from the Javasoft <a - * href="http://www.javasoft.com/products/jdk/1.2/docs/guide/collections/index.html"> - * Collections framework</a> written by Josh Bloch and introduced in JDK 1.2. - * <h2>2. Introduction</h2> - * <p>A map is an associative container that manages a set of (key,value) pairs. - * It is useful for implementing a collection of one-to-one mappings. A (key,value) - * pair is called an <i>association</i>. A value can be looked up up via its key. - * Associations can quickly be set, removed and retrieved. They are stored in a - * hashing structure based on the hash code of their keys, which is obtained by - * using a hash function. </p> - * <p> A map can, for example, contain <tt>Name-->Location</tt> associations like - * <tt>{("Pete", "Geneva"), ("Steve", "Paris"), ("Robert", "New York")}</tt> used - * in address books or <tt>Index-->Value</tt> mappings like <tt>{(0, 100), (3, - * 1000), (100000, 70)}</tt> representing sparse lists or matrices. For example - * this could mean at index 0 we have a value of 100, at index 3 we have a value - * of 1000, at index 1000000 we have a value of 70, and at all other indexes we - * have a value of, say, zero. Another example is a map of IP addresses to domain - * names (DNS). Maps can also be useful to represent<i> multi sets</i>, that is, - * sets where elements can occur more than once. For multi sets one would have - * <tt>Value-->Frequency</tt> mappings like <tt>{(100, 1), (50, 1000), (101, 3))}</tt> - * meaning element 100 occurs 1 time, element 50 occurs 1000 times, element 101 - * occurs 3 times. Further, maps can also manage <tt>ObjectIdentifier-->Object</tt> - * mappings like <tt>{(12, obj1), (7, obj2), (10000, obj3), (9, obj4)}</tt> used - * in Object Databases. - * <p> A map cannot contain two or more <i>equal</i> keys; a key can map to at most - * one value. However, more than one key can map to identical values. For primitive - * data types "equality" of keys is defined as identity (operator <tt>==</tt>). - * For maps using <tt>Object</tt> keys, the meaning of "equality" can be specified - * by the user upon instance construction. It can either be defined to be identity - * (operator <tt>==</tt>) or to be given by the method {@link java.lang.Object#equals(Object)}. - * Associations of kind <tt>(AnyType,Object)</tt> can be of the form <tt>(AnyKey,null) - * </tt>, i.e. values can be <tt>null</tt>. - * <p> The classes of this package make no guarantees as to the order of the elements - * returned by iterators; in particular, they do not guarantee that the order will - * remain constant over time. - * <h2></h2> - * <h4>Copying</h4> - * <p> - * <p>Any map can be copied. A copy is <i>equal</i> to the original but entirely - * independent of the original. So changes in the copy are not reflected in the - * original, and vice-versa. - * <h2>3. Package organization</h2> - * <p>For most primitive data types and for objects there exists a separate map version. - * All versions are just the same, except that they operate on different data types. - * Colt includes two kinds of implementations for maps: The two different implementations - * are tagged <b>Chained</b> and <b>Open</b>. - * Note: Chained is no more included. Wherever it is mentioned it is of historic interest only.</p> - * <ul> - * <li><b>Chained</b> uses extendible separate chaining with chains holding unsorted - * dynamically linked collision lists. - * <li><b>Open</b> uses extendible open addressing with double hashing. - * </ul> - * <p>Class naming follows the schema <tt><Implementation><KeyType><ValueType>HashMap</tt>. - * For example, a {@link org.apache.mahout.math.map.OpenIntDoubleHashMap} holds <tt>(int-->double)</tt> - * associations and is implemented with open addressing. A {@link org.apache.mahout.math.map.OpenIntObjectHashMap} - * holds <tt>(int-->Object)</tt> associations and is implemented with open addressing. - * </p> - * <p>The classes for maps of a given (key,value) type are derived from a common - * abstract base class tagged <tt>Abstract<KeyType><ValueType></tt><tt>Map</tt>. - * For example, all maps operating on <tt>(int-->double)</tt> associations are - * derived from {@link org.apache.mahout.math.map.AbstractIntDoubleMap}, which in turn is derived - * from an abstract base class tying together all maps regardless of assocation - * type, {@link org.apache.mahout.math.set.AbstractSet}. The abstract base classes provide skeleton - * implementations for all but few methods. Experimental layouts (such as chaining, - * open addressing, extensible hashing, red-black-trees, etc.) can easily be implemented - * and inherit a rich set of functionality. Have a look at the javadoc <a href="package-tree.html">tree - * view</a> to get the broad picture.</p> - * <h2>4. Example usage</h2> - * <TABLE> - * <TD CLASS="PRE"> - * <PRE> - * int[] keys = {0 , 3 , 100000, 9 }; - * double[] values = {100.0, 1000.0, 70.0 , 71.0}; - * AbstractIntDoubleMap map = new OpenIntDoubleHashMap(); - * // add several associations - * for (int i=0; i < keys.length; i++) map.put(keys[i], values[i]); - * log.info("map="+map); - * log.info("size="+map.size()); - * log.info(map.containsKey(3)); - * log.info("get(3)="+map.get(3)); - * log.info(map.containsKey(4)); - * log.info("get(4)="+map.get(4)); - * log.info(map.containsValue(71.0)); - * log.info("keyOf(71.0)="+map.keyOf(71.0)); - * // remove one association - * map.removeKey(3); - * log.info("\nmap="+map); - * log.info(map.containsKey(3)); - * log.info("get(3)="+map.get(3)); - * log.info(map.containsValue(1000.0)); - * log.info("keyOf(1000.0)="+map.keyOf(1000.0)); - * // clear - * map.clear(); - * log.info("\nmap="+map); - * log.info("size="+map.size()); - * </PRE> - * </TD> - * </TABLE> - * yields the following output - * <TABLE> - * <TD CLASS="PRE"> - * <PRE> - * map=[0->100.0, 3->1000.0, 9->71.0, 100000->70.0] - * size=4 - * true - * get(3)=1000.0 - * false - * get(4)=0.0 - * true - * keyOf(71.0)=9 - * map=[0->100.0, 9->71.0, 100000->70.0] - * false - * get(3)=0.0 - * false - * keyOf(1000.0)=-2147483648 - * map=[] - * size=0 - * </PRE> - * </TD> - * </TABLE> - * <h2> 5. Notes </h2> - * <p> - * Note that implementations are not synchronized. - * <p> - * Choosing efficient parameters for hash maps is not always easy. - * However, since parameters determine efficiency and memory requirements, here is a quick guide how to choose them. - * If your use case does not heavily operate on hash maps but uses them just because they provide - * convenient functionality, you can safely skip this section. - * For those of you who care, read on. - * <p> - * There are three parameters that can be customized upon map construction: <tt>initialCapacity</tt>, - * <tt>minLoadFactor</tt> and <tt>maxLoadFactor</tt>. - * The more memory one can afford, the faster a hash map. - * The hash map's capacity is the maximum number of associations that can be added without needing to allocate new - * internal memory. - * A larger capacity means faster adding, searching and removing. - * The <tt>initialCapacity</tt> corresponds to the capacity used upon instance construction. - * <p> - * The <tt>loadFactor</tt> of a hash map measures the degree of "fullness". - * It is given by the number of assocations (<tt>size()</tt>) - * divided by the hash map capacity <tt>(0.0 <= loadFactor <= 1.0)</tt>. - * The more associations are added, the larger the loadFactor and the more hash map performance degrades. - * Therefore, when the loadFactor exceeds a customizable threshold (<tt>maxLoadFactor</tt>), the hash map is - * automatically grown. - * In such a way performance degradation can be avoided. - * Similarly, when the loadFactor falls below a customizable threshold (<tt>minLoadFactor</tt>), the hash map is - * automatically shrinked. - * In such a way excessive memory consumption can be avoided. - * Automatic resizing (both growing and shrinking) obeys the following invariant: - * <p> - * <tt>capacity * minLoadFactor <= size() <= capacity * maxLoadFactor</tt> - * <p> The term <tt>capacity * minLoadFactor</tt> is called the <i>low water mark</i>, - * <tt>capacity * maxLoadFactor</tt> is called the <i>high water mark</i>. In other - * words, the number of associations may vary within the water mark constraints. - * When it goes out of range, the map is automatically resized and memory consumption - * changes proportionally. - * <ul> - * <li>To tune for memory at the expense of performance, both increase <tt>minLoadFactor</tt> and - * <tt>maxLoadFactor</tt>. - * <li>To tune for performance at the expense of memory, both decrease <tt>minLoadFactor</tt> and - * <tt>maxLoadFactor</tt>. - * As as special case set <tt>minLoadFactor=0</tt> to avoid any automatic shrinking. - * </ul> - * Resizing large hash maps can be time consuming, <tt>O(size())</tt>, and should be avoided if possible (maintaining - * primes is not the reason). - * Unnecessary growing operations can be avoided if the number of associations is known before they are added, or can be - * estimated.<p> - * In such a case good parameters are as follows: - * <p> - * <i>For chaining:</i> - * <br>Set the <tt>initialCapacity = 1.4*expectedSize</tt> or greater. - * <br>Set the <tt>maxLoadFactor = 0.8</tt> or greater. - * <p> - * <i>For open addressing:</i> - * <br>Set the <tt>initialCapacity = 2*expectedSize</tt> or greater. Alternatively call <tt>ensureCapacity(...)</tt>. - * <br>Set the <tt>maxLoadFactor = 0.5</tt>. - * <br>Never set <tt>maxLoadFactor > 0.55</tt>; open addressing exponentially slows down beyond that point. - * <p> - * In this way the hash map will never need to grow and still stay fast. - * It is never a good idea to set <tt>maxLoadFactor < 0.1</tt>, - * because the hash map would grow too often. - * If it is entirelly unknown how many associations the application will use, - * the default constructor should be used. The map will grow and shrink as needed. - * <p> - * <b>Comparision of chaining and open addressing</b> - * <p> Chaining is faster than open addressing, when assuming unconstrained memory - * consumption. Open addressing is more space efficient than chaining, because - * it does not create entry objects but uses primitive arrays which are considerably - * smaller. Entry objects consume significant amounts of memory compared to the - * information they actually hold. Open addressing also poses no problems to the - * garbage collector. In contrast, chaining can create millions of entry objects - * which are linked; a nightmare for any garbage collector. In addition, entry - * object creation is a bit slow. <br> - * Therefore, with the same amount of memory, or even less memory, hash maps with - * larger capacity can be maintained under open addressing, which yields smaller - * loadFactors, which in turn keeps performance competitive with chaining. In our - * benchmarks, using significantly less memory, open addressing usually is not - * more than 1.2-1.5 times slower than chaining. - * <p><b>Further readings</b>: - * <br>Knuth D., The Art of Computer Programming: Searching and Sorting, 3rd ed. - * <br>Griswold W., Townsend G., The Design and Implementation of Dynamic Hashing for Sets and Tables in Icon, - * Software - Practice and Experience, Vol. 23(4), 351-367 (April 1993). - * <br>Larson P., Dynamic hash tables, Comm. of the ACM, 31, (4), 1988. - * <p> - * <b>Performance:</b> - * <p> - * Time complexity: - * <br>The classes offer <i>expected</i> time complexity <tt>O(1)</tt> (i.e. constant time) for the basic operations - * <tt>put</tt>, <tt>get</tt>, <tt>removeKey</tt>, <tt>containsKey</tt> and <tt>size</tt>, - * assuming the hash function disperses the elements properly among the buckets. - * Otherwise, pathological cases, although highly improbable, can occur, degrading performance to <tt>O(N)</tt> in the - * worst case. - * Operations <tt>containsValue</tt> and <tt>keyOf</tt> are <tt>O(N)</tt>. - * <p> - * Memory requirements for <i>open addressing</i>: - * <br>worst case: <tt>memory [bytes] = (1/minLoadFactor) * size() * (1 + sizeOf(key) + sizeOf(value))</tt>. - * <br>best case: <tt>memory [bytes] = (1/maxLoadFactor) * size() * (1 + sizeOf(key) + sizeOf(value))</tt>. - * Where <tt>sizeOf(int) = 4</tt>, <tt>sizeOf(double) = 8</tt>, <tt>sizeOf(Object) = 4</tt>, etc. - * Thus, an <tt>OpenIntIntHashMap</tt> with minLoadFactor=0.25 and maxLoadFactor=0.5 and 1000000 associations uses - * between 17 MB and 34 MB. - * The same map with 1000 associations uses between 17 and 34 KB. - * <p> - * </BODY> - * </HTML> - */ -package org.apache.mahout.math.map; http://git-wip-us.apache.org/repos/asf/mahout/blob/49ad8cb4/core/src/main/java/org/apache/mahout/math/package-info.java ---------------------------------------------------------------------- diff --git a/core/src/main/java/org/apache/mahout/math/package-info.java b/core/src/main/java/org/apache/mahout/math/package-info.java deleted file mode 100644 index de664f0..0000000 --- a/core/src/main/java/org/apache/mahout/math/package-info.java +++ /dev/null @@ -1,4 +0,0 @@ -/** - * Core base classes; Operations on primitive arrays such as sorting, partitioning and permuting. - */ -package org.apache.mahout.math; http://git-wip-us.apache.org/repos/asf/mahout/blob/49ad8cb4/core/src/main/java/org/apache/mahout/math/set/AbstractSet.java ---------------------------------------------------------------------- diff --git a/core/src/main/java/org/apache/mahout/math/set/AbstractSet.java b/core/src/main/java/org/apache/mahout/math/set/AbstractSet.java deleted file mode 100644 index 7691420..0000000 --- a/core/src/main/java/org/apache/mahout/math/set/AbstractSet.java +++ /dev/null @@ -1,188 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -/* -Copyright 1999 CERN - European Organization for Nuclear Research. -Permission to use, copy, modify, distribute and sell this software and its documentation for any purpose -is hereby granted without fee, provided that the above copyright notice appear in all copies and -that both that copyright notice and this permission notice appear in supporting documentation. -CERN makes no representations about the suitability of this software for any purpose. -It is provided "as is" without expressed or implied warranty. -*/ -package org.apache.mahout.math.set; - -import org.apache.mahout.math.PersistentObject; -import org.apache.mahout.math.map.PrimeFinder; - -public abstract class AbstractSet extends PersistentObject { - //public static boolean debug = false; // debug only - - /** The number of distinct associations in the map; its "size()". */ - protected int distinct; - - /** - * The table capacity c=table.length always satisfies the invariant <tt>c * minLoadFactor <= s <= c * - * maxLoadFactor</tt>, where s=size() is the number of associations currently contained. The term "c * minLoadFactor" - * is called the "lowWaterMark", "c * maxLoadFactor" is called the "highWaterMark". In other words, the table capacity - * (and proportionally the memory used by this class) oscillates within these constraints. The terms are precomputed - * and cached to avoid recalculating them each time put(..) or removeKey(...) is called. - */ - protected int lowWaterMark; - protected int highWaterMark; - - /** The minimum load factor for the hashtable. */ - protected double minLoadFactor; - - /** The maximum load factor for the hashtable. */ - protected double maxLoadFactor; - - // these are public access for unit tests. - public static final int DEFAULT_CAPACITY = 277; - public static final double DEFAULT_MIN_LOAD_FACTOR = 0.2; - public static final double DEFAULT_MAX_LOAD_FACTOR = 0.5; - - /** - * Chooses a new prime table capacity optimized for growing that (approximately) satisfies the invariant <tt>c * - * minLoadFactor <= size <= c * maxLoadFactor</tt> and has at least one FREE slot for the given size. - */ - protected int chooseGrowCapacity(int size, double minLoad, double maxLoad) { - return nextPrime(Math.max(size + 1, (int) ((4 * size / (3 * minLoad + maxLoad))))); - } - - /** - * Returns new high water mark threshold based on current capacity and maxLoadFactor. - * - * @return int the new threshold. - */ - protected int chooseHighWaterMark(int capacity, double maxLoad) { - return Math.min(capacity - 2, (int) (capacity * maxLoad)); //makes sure there is always at least one FREE slot - } - - /** - * Returns new low water mark threshold based on current capacity and minLoadFactor. - * - * @return int the new threshold. - */ - protected int chooseLowWaterMark(int capacity, double minLoad) { - return (int) (capacity * minLoad); - } - - /** - * Chooses a new prime table capacity neither favoring shrinking nor growing, that (approximately) satisfies the - * invariant <tt>c * minLoadFactor <= size <= c * maxLoadFactor</tt> and has at least one FREE slot for the given - * size. - */ - protected int chooseMeanCapacity(int size, double minLoad, double maxLoad) { - return nextPrime(Math.max(size + 1, (int) ((2 * size / (minLoad + maxLoad))))); - } - - /** - * Chooses a new prime table capacity optimized for shrinking that (approximately) satisfies the invariant <tt>c * - * minLoadFactor <= size <= c * maxLoadFactor</tt> and has at least one FREE slot for the given size. - */ - protected int chooseShrinkCapacity(int size, double minLoad, double maxLoad) { - return nextPrime(Math.max(size + 1, (int) ((4 * size / (minLoad + 3 * maxLoad))))); - } - - /** Removes all (key,value) associations from the receiver. */ - public abstract void clear(); - - /** - * Ensures that the receiver can hold at least the specified number of elements without needing to allocate new - * internal memory. If necessary, allocates new internal memory and increases the capacity of the receiver. <p> This - * method never need be called; it is for performance tuning only. Calling this method before <tt>put()</tt>ing a - * large number of associations boosts performance, because the receiver will grow only once instead of potentially - * many times. <p> <b>This default implementation does nothing.</b> Override this method if necessary. - * - * @param minCapacity the desired minimum capacity. - */ - public void ensureCapacity(int minCapacity) { - } - - /** - * Returns <tt>true</tt> if the receiver contains no (key,value) associations. - * - * @return <tt>true</tt> if the receiver contains no (key,value) associations. - */ - public boolean isEmpty() { - return distinct == 0; - } - - /** - * Returns a prime number which is <code>>= desiredCapacity</code> and very close to <code>desiredCapacity</code> - * (within 11% if <code>desiredCapacity >= 1000</code>). - * - * @param desiredCapacity the capacity desired by the user. - * @return the capacity which should be used for a hashtable. - */ - protected int nextPrime(int desiredCapacity) { - return PrimeFinder.nextPrime(desiredCapacity); - } - - /** - * Initializes the receiver. You will almost certainly need to override this method in subclasses to initialize the - * hash table. - * - * @param initialCapacity the initial capacity of the receiver. - * @param minLoadFactor the minLoadFactor of the receiver. - * @param maxLoadFactor the maxLoadFactor of the receiver. - * @throws IllegalArgumentException if <tt>initialCapacity < 0 || (minLoadFactor < 0.0 || minLoadFactor >= 1.0) || - * (maxLoadFactor <= 0.0 || maxLoadFactor >= 1.0) || (minLoadFactor >= - * maxLoadFactor)</tt>. - */ - protected void setUp(int initialCapacity, double minLoadFactor, double maxLoadFactor) { - if (initialCapacity < 0) { - throw new IllegalArgumentException("Initial Capacity must not be less than zero: " + initialCapacity); - } - if (minLoadFactor < 0.0 || minLoadFactor >= 1.0) { - throw new IllegalArgumentException("Illegal minLoadFactor: " + minLoadFactor); - } - if (maxLoadFactor <= 0.0 || maxLoadFactor >= 1.0) { - throw new IllegalArgumentException("Illegal maxLoadFactor: " + maxLoadFactor); - } - if (minLoadFactor >= maxLoadFactor) { - throw new IllegalArgumentException( - "Illegal minLoadFactor: " + minLoadFactor + " and maxLoadFactor: " + maxLoadFactor); - } - } - - /** - * Returns the number of (key,value) associations currently contained. - * - * @return the number of (key,value) associations currently contained. - */ - public int size() { - return distinct; - } - - /** - * Trims the capacity of the receiver to be the receiver's current size. Releases any superfluous internal memory. An - * application can use this operation to minimize the storage of the receiver. <p> This default implementation does - * nothing. Override this method if necessary. - */ - public void trimToSize() { - } - - protected static boolean equalsMindTheNull(Object a, Object b) { - if (a == null && b == null) { - return true; - } - if (a == null || b == null) { - return false; - } - return a.equals(b); - } -} http://git-wip-us.apache.org/repos/asf/mahout/blob/49ad8cb4/core/src/main/java/org/apache/mahout/math/set/HashUtils.java ---------------------------------------------------------------------- diff --git a/core/src/main/java/org/apache/mahout/math/set/HashUtils.java b/core/src/main/java/org/apache/mahout/math/set/HashUtils.java deleted file mode 100644 index f5dfeb0..0000000 --- a/core/src/main/java/org/apache/mahout/math/set/HashUtils.java +++ /dev/null @@ -1,56 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.mahout.math.set; - -/** - * Computes hashes of primitive values. Providing these as statics allows the templated code - * to compute hashes of sets. - */ -public final class HashUtils { - - private HashUtils() { - } - - public static int hash(byte x) { - return x; - } - - public static int hash(short x) { - return x; - } - - public static int hash(char x) { - return x; - } - - public static int hash(int x) { - return x; - } - - public static int hash(float x) { - return Float.floatToIntBits(x) >>> 3 + Float.floatToIntBits((float) (Math.PI * x)); - } - - public static int hash(double x) { - return hash(17 * Double.doubleToLongBits(x)); - } - - public static int hash(long x) { - return (int) ((x * 11) >>> 32 ^ x); - } -}
