Actually, I think that I would put train and test out of the fast server.
 The train/test/manage training set can all be done off-line.

In my book example, I suggested that ZK be used to cause models to be loaded
and that Thrift (or rest or Avro or whatever) be used just for doing
classification.  A more general structure would allow multiple models to be
loaded at one time and allow multiple examples to be classified in one
request.

If you add a general enough data structure for a "document" then this might
be a pretty useful service.  I don't think that rest is a good choice at
that point, but Avro or protobufs/netty would be.  Thrift might work well
enough as well, but polymorphism would be nice to have.

Regardless of the on-line nature of the SGD models, I don't think that
training in the real-time service is such a great thing.  I could be wrong.



On Mon, Apr 25, 2011 at 3:45 AM, Grant Ingersoll <[email protected]>wrote:

> For classifiers/clustering, it seems like one should be able to start
> simple:
>
> 1. Train
> 2. Test (including cross-validation)
> 3. Classify/Cluster (for each algorithm) both sequentially and on M/R
> (again, could submit files to external resources like Amazon)
> 4. Add/Delete/Update examples (for training and testing)
>
> I realize this is non-trivial and there are a lot of details to work out,
> particularly on the spawning of M/R jobs and the "single data point against
> multiple models" approach, but the rest isn't as bad, I don't think.
>
>

Reply via email to