On Mon, Jun 21, 2010 at 8:35 PM, Robin Anil <[email protected]> wrote:

> See how this sound(listing down requirements)
>
> A model can be class with a list of matrices, a list of vectors. Each
> algorithm takes care of naming these matrices/vectors and reading and
> writing values to it (similar to Datastore)
>

I think that this is too restrictive.  I would prefer that models are
essentially opaque blobs in wire or disk format but whatever model you have
can be instantiated using a standard factory.


> All Classifiers will work with vectors
> All Trainers will work with vectors
>

Yes.

There should also be a standard framework that allows conversion to vectors
without moving vectors across IPC links.


Multiple techniques to vectorize data.
> - Dictionary based
> - Random hashing based
>

Yes.

Random hashing especially needs to handle interaction variables well.

We need fielded data as well and support for continuous variables in
addition to text-like data.


> A Classifier Training Job will take a Trainer, and a Vector location and
> produce a Model
>

No.  Well, not exclusively, anyway.  We can't be limited to reading vectors
due to the fairly substantial (3x) performance hit that will entail.

I would recommend that a training job will take a Trainer, a Vectorizer, an
InputSource and produce a Model.

A Classifier Testing Job will take a Classifier, a Model and a Test Vector
> location and produce statistics
>

Again, need a vectorizer.


> A Classifier Job will take a Classifier, a Model and a vector location and
> label the vectors with probability or likelihood values and return 1 or top
> N labels
>

Again, need a vectorizer.

I think that we should designate a list of preserved fields which may be no
more than the id and the output should be attached.  Possible forms are

top k labels (with or without probs)
all probabilities


> Model Storage
> Datastore has a list of matrices and a list of vectors. It can be
> serialized
> to disk. Or stored on Hbase or any other Hashtable
> implementation(memcached)
>

I prefer that a model is a blob, preferably some what inspectable such as
with JSON formats.

Reply via email to