On Sat, Oct 3, 2009 at 4:08 PM, Ted Dunning <[email protected]> wrote:

> > What do you mean by both "labels as an idea can be separated from
> matrices"
> > and "matrix operations should be by label rather than by index"?  These
> > sound
> > like contradictory statements to me - the latter means that matrices are
> > inherently
> > tied to labels.
> >
>
> Yeah... I think I wrote that poorly.
>
> Let me try again.
>
> If I have two vectors that I think of as word counts, {"a": 100, "b": 20}
> and {"b": 2, "c": 10}, then I absolutely want to have the dot product be 40
> and not 400.  That is, I want the product of the values for "b".
>
> The simplest way to do this is to index using strings.
>

This right here presupposes that your vectors have been created in some way
which are a) sparse (as you mentioned),  b) intrinsically related to their
labels, and c) those labels are Strings.  A common case, but is everything
in the world of machine learning like that?


> Another way to do this is always build sparse vectors using coherent
> integer
> codes.  Thus, if the count for "b" gets put into location 23 in one vector,
> it will get put into 23 in all other vectors or else they be
> non-conformable
> due to a domain exception.  Conversely, no two labels should be put into
> the
> same location without being identical.
>

I certainly agree on the first point (the labels -> index mapping should be
single-valued), but the latter is again, common, but not required for all
applications: in high enough dimensions, allowing some collisions with
low probability can be fine - pLSI is a good example of how this can work,
and kernelized decompositions are another, I would imagine.  But your
point is taken, typically a 1:1 map is fine, and admittedly, the case where
the map is simply "toString()" <-> "valueOf()" allows for the case where
no Stringy label is needed.


> If our actual implementation of vectors doesn't know about strings, then we
> can build a string dictionary class and a vector wrapper class with a
> reference to a string dictionary.  All operations on the vectors (except
> get/put) would be delegated to the underlying vector operations after a
> check to verify that the string dictionaries for the two wrappers are
> identical.  If the dictionaries are identical then we know that the
> encodings are the same and we don't have to worry about the internals. Get
> and put are special since we have to add string based versions that look up
> the string and use the resulting index.
>
> In this scheme, the vector implementation itself knows nothing about labels
> and yet all operations proceed as if it did.
>

Ok, this is reasonable - but going this way we shouldn't tie the labels to
be
Strings - something like a generic

interface LabelDictionary<T> {
  T getLabel(int index);
  int getIndex(T label);
}

allows for LabelDictionary<String> as choice, but allows for flexibility
(such
as having if they're strings or token ngrams keep track of their IDF or
number
of tokens or underlying type [i.e. you're doing regression on some model
with
a lot of numeric parameters, but pre-normalization they all carried
different
units]).


> Frankly, the difference will be bigger than that because the string
> dictionary needs to be shared and thus concurrency safe.  Because the
> object
> is shared, it wouldn't even be easy to make the dictionary immutable.
>

Well once it's immutable, it's very thread-safe, so why exactly would it be
hard
to make the dictionary immutable?  You make it once, then seal it up... I
guess
you're referring to the fact that in the text corpus case, you might be
building
up the dictionary in process, and thus logically can't lock it down?

I wonder how much would break if a matrix only know how large it was so
> far.  Or in the case of labeled vectors if it know how many elements were
> in
> the shared dictionary so far.
>
> Seems to me like it much just work well.
>

How much breaks if the matrix doesn't even know anything about it's
dimensionality?  I've always put checks like that it just to catch
programmer
error (you have lots of vectors flying around, living in different spaces,
and
accidentally doing twentyDSparseVector.dot(tenDSparseVector) either
gives back normal floating point answers which lead to hard to track down
bugs (which is why I've sometimes coded up my vector classes as being
tied to marker interfaces: Vector<T extends VectorDomain>, so that my
IDE can use type-safety to keep me from being too stupid.  But dealing with
collections of these, and subinterfaces of them, and Matrices of them...
can lead to pretty ugly java generics code, whose readability is poor enough
to overwhelm the benefits, IMO).


> > Defining DomainException instead of CardinalityException, to be thrown
> when
> > the label sets are different, would be a lot better, as long as we're
> only
> > requiring, say, that you carry around the *name* of the label set, not
> the
> > full set, if
> > you are working at the lower level "by index only" apis.
> >
>
> I think a reference should be sufficient.
>

It seems like you're assuming single-JVM operations here - I mention
name (or any other unique identifier) to take care of the Hadoop case
where multiple types of vectors can live on the same machines as well
as different ones.


> Experiment 1:
>
> build a simple label wrapper for commons math or jplasma.
>
> Build a few sample apps to find where it binds (say an in-memory word
> counter, cooccurrence computer and simple SVD implementation).
>
> Success would be had if this could be done in a few hours without changing
> the underlying matrix implementation.
>
> Experiment 2:
>
> Modify the sparse matrix implementation used in (1) to be an extensible
> matrix and make the wrapper query the dictionary for size questions.
>

Ok - effectively test it all at once - first labels, then extendable
cardinalities,
sounds like a reasonable experiment...

  -jake

Reply via email to