Re: [ANNOUNCEMENT] Apache Commons Math 2.0 Released

Ted Dunning Sat, 03 Oct 2009 16:09:11 -0700

On Sat, Oct 3, 2009 at 1:49 PM, Jake Mannix <[email protected]> wrote:

> On Sat, Oct 3, 2009 at 12:30 PM, Ted Dunning <[email protected]>
> wrote:
>
> > Labels are the only thing that scares me.  It may be that we really need
> to
> > figure out a good answer to that in any case so that labels as an idea
> can
> > be separated from matrices.
> >
> > The real problem is that matrix operations should be by label rather than
> > index.  If we can somehow make the indexes universal, then we should be
> OK.
> > One way to do that is to broaden the idea of conformability in matrices
> to
> > require that the were created using a common label dictionary for the
> > conformable indexes.
> >
>
> What do you mean by both "labels as an idea can be separated from matrices"
> and "matrix operations should be by label rather than by index"?  These
> sound
> like contradictory statements to me - the latter means that matrices are
> inherently
> tied to labels.
>

Yeah... I think I wrote that poorly.

Let me try again.

If I have two vectors that I think of as word counts, {"a": 100, "b": 20}
and {"b": 2, "c": 10}, then I absolutely want to have the dot product be 40
and not 400.  That is, I want the product of the values for "b".

The simplest way to do this is to index using strings.

Another way to do this is always build sparse vectors using coherent integer
codes.  Thus, if the count for "b" gets put into location 23 in one vector,
it will get put into 23 in all other vectors or else they be non-conformable
due to a domain exception.  Conversely, no two labels should be put into the
same location without being identical.

If our actual implementation of vectors doesn't know about strings, then we
can build a string dictionary class and a vector wrapper class with a
reference to a string dictionary.  All operations on the vectors (except
get/put) would be delegated to the underlying vector operations after a
check to verify that the string dictionaries for the two wrappers are
identical.  If the dictionaries are identical then we know that the
encodings are the same and we don't have to worry about the internals. Get
and put are special since we have to add string based versions that look up
the string and use the resulting index.

In this scheme, the vector implementation itself knows nothing about labels
and yet all operations proceed as if it did.

I worry about the performance of the current api if we encouraged people to
> always address values in a Vector via get(String label) (which seems to be
> what you're implying if we encourage always using labels not indices).
> What
> could be a method call and an array access (getQuick(index) ), is instead a
> method
> call, a HashMap get(String), another method call, a bounds-check, and then
> an
> array lookup.  Maybe the JIT is smart enough to handle most of this, but
> I'd be
> surprised if there wasn't a difference here.
>

Frankly, the difference will be bigger than that because the string
dictionary needs to be shared and thus concurrency safe.  Because the object
is shared, it wouldn't even be easy to make the dictionary immutable.

I would actually still recommend using get/put based on labels rather than
integers and then recommend also that they use higher level operations for
the most part.

> Another issue is that some matrices are essentially unbounded (or we do
> not
> > know the bounds). ...
> I'm totally down with you on this one - the current setup where Matrix and
> Vector impls are required to know their final dimensionality at
> construction I
> certainly find pretty constraining: it requires that I make one full pass
> through my
> data just to measure how big everything is.
>

I wonder how much would break if a matrix only know how large it was so
far.  Or in the case of labeled vectors if it know how many elements were in
the shared dictionary so far.

Seems to me like it much just work well.

> Defining DomainException instead of CardinalityException, to be thrown when
> the label sets are different, would be a lot better, as long as we're only
> requiring, say, that you carry around the *name* of the label set, not the
> full set, if
> you are working at the lower level "by index only" apis.
>

I think a reference should be sufficient.

>  > > What are people's inclinations on this?
> > >
> >
> > Try an experiment?
> >
>
> What kind of experiment?  There are a lot of ideas thrown around -
> relationship between labels and matrices, using CommonsMath underlying apis
> and
> implementations, separating Writable from Vector/Matrix, unbinding
> cardinalities from instantiation...
>

Experiment 1:

build a simple label wrapper for commons math or jplasma.

Build a few sample apps to find where it binds (say an in-memory word
counter, cooccurrence computer and simple SVD implementation).

Success would be had if this could be done in a few hours without changing
the underlying matrix implementation.

Experiment 2:

Modify the sparse matrix implementation used in (1) to be an extensible
matrix and make the wrapper query the dictionary for size questions.

Re: [ANNOUNCEMENT] Apache Commons Math 2.0 Released

Reply via email to