Hi guys, I see Benson working really hard on converting Colt primitive collections to Mahout -- this is great effort, really, since no such library currently exists with an Apache or BSD license.
I wanted to ask you if compatibility with Java Collections is something you consider crucial for a set of collection classes. There are pros and cons of this compatibility. 1) compatibility with the Java Collections API gives you tons of other libraries (Google Collections, for example) which you can use out of the box with primitives, 2) compatibility with the Java Collections API means boxing/unboxing conversions on standard API calls and awkward other methods, should you wish to avoid such conversions, 3) collection classes have many strong contracts, including fast-failing iterators, etc. These are largerly unnecessary for computational code. 4) you may be fond of certain idioms you might have grown accustomed to (subList().clear()). 5) resigning from certain API elements can yield much faster code (resigning from bounds check, exposing the internal implementation of a given type for custom processing). I'm asking because we at Carrot Search have been working on a similar library for managing basic collection types for primitives (and generics), namely: - hash maps (open addressing), - sets (open addressing), - efficient bit set and bit operations (we imported stuff from Lucene), - stacks, dequeues, arrays. Our line of thinking eventually led us to create a library that is MOSTLY API-compatible, but NOT interface compatible. That is, for example, you get put(byte, int) methods on a hash map specialized for byte keys and int values, but this hash map does not implement Map<Byte, Integer>. It is therefore not a general purpose Java Collections replacement, but for computational code we found our implementation very efficient and straightforward. I have the code ready, tested and we'd be willing to contribute this entirely to the Apache foundation. The upside is that it's royalty-free (white box reimplementation of basic data structures). Some of the code was borrowed from Lucene (BitSet) and the method of open addressing is quadratic rather than double-hashing, which was inspired by the work done on Google sparse hashes. I hope Benson won't feel offended -- he's done a great job working on Colt's code, but if you think the above assumptions are fine (the primary one being breaking the compatibility with Java Collections), then perhaps we should apply for a commons sub-project (we currently call this library "high performance primitive collections") and join the efforts under a single umbrella for everyone's benefit? Dawid