On Mon, Jun 21, 2010 at 8:35 PM, Robin Anil <[email protected]> wrote:
> See how this sound(listing down requirements) > > A model can be class with a list of matrices, a list of vectors. Each > algorithm takes care of naming these matrices/vectors and reading and > writing values to it (similar to Datastore) > I think that this is too restrictive. I would prefer that models are essentially opaque blobs in wire or disk format but whatever model you have can be instantiated using a standard factory. > All Classifiers will work with vectors > All Trainers will work with vectors > Yes. There should also be a standard framework that allows conversion to vectors without moving vectors across IPC links. Multiple techniques to vectorize data. > - Dictionary based > - Random hashing based > Yes. Random hashing especially needs to handle interaction variables well. We need fielded data as well and support for continuous variables in addition to text-like data. > A Classifier Training Job will take a Trainer, and a Vector location and > produce a Model > No. Well, not exclusively, anyway. We can't be limited to reading vectors due to the fairly substantial (3x) performance hit that will entail. I would recommend that a training job will take a Trainer, a Vectorizer, an InputSource and produce a Model. A Classifier Testing Job will take a Classifier, a Model and a Test Vector > location and produce statistics > Again, need a vectorizer. > A Classifier Job will take a Classifier, a Model and a vector location and > label the vectors with probability or likelihood values and return 1 or top > N labels > Again, need a vectorizer. I think that we should designate a list of preserved fields which may be no more than the id and the output should be attached. Possible forms are top k labels (with or without probs) all probabilities > Model Storage > Datastore has a list of matrices and a list of vectors. It can be > serialized > to disk. Or stored on Hbase or any other Hashtable > implementation(memcached) > I prefer that a model is a blob, preferably some what inspectable such as with JSON formats.
