Hi guys,

I see Benson working really hard on converting Colt primitive
collections to Mahout -- this is great effort, really, since no such
library currently exists with an Apache or BSD license.

I wanted to ask you if compatibility with Java Collections is
something you consider crucial for a set of collection classes. There
are pros and cons of this compatibility.

1) compatibility with the Java Collections API gives you tons of other
libraries (Google Collections, for example) which you can use out of
the box with primitives,

2) compatibility with the Java Collections API means boxing/unboxing
conversions on standard API calls and awkward other methods, should
you wish to avoid such conversions,

3) collection classes have many strong contracts, including
fast-failing iterators, etc. These are largerly unnecessary for
computational code.

4) you may be fond of certain idioms you might have grown accustomed
to (subList().clear()).

5) resigning from certain API elements can yield much faster code
(resigning from bounds check, exposing the internal implementation of
a given type for custom processing).

I'm asking because we at Carrot Search have been working on a similar
library for managing basic collection types for primitives (and
generics), namely:

- hash maps (open addressing),
- sets (open addressing),
- efficient bit set and bit operations (we imported stuff from Lucene),
- stacks, dequeues, arrays.

Our line of thinking eventually led us to create a library that is
MOSTLY API-compatible, but NOT interface compatible. That is, for
example, you get put(byte, int) methods on a hash map specialized for
byte keys and int values, but this hash map does not implement
Map<Byte, Integer>. It is therefore not a general purpose Java
Collections replacement, but for computational code we found our
implementation very efficient and straightforward.

I have the code ready, tested and we'd be willing to contribute this
entirely to the Apache foundation. The upside is that it's
royalty-free (white box reimplementation of basic data structures).
Some of the code was borrowed from Lucene (BitSet) and the method of
open addressing is quadratic rather than double-hashing, which was
inspired by the work done on Google sparse hashes.

 I hope Benson won't feel offended -- he's done a great job working on
Colt's code, but if you think the above assumptions are fine (the
primary one being breaking the compatibility with Java Collections),
then perhaps we should apply for a commons sub-project (we currently
call this library "high performance primitive collections") and join
the efforts under a single umbrella for everyone's benefit?

Dawid

Reply via email to