Actually, I think that I would put train and test out of the fast server. The train/test/manage training set can all be done off-line.
In my book example, I suggested that ZK be used to cause models to be loaded and that Thrift (or rest or Avro or whatever) be used just for doing classification. A more general structure would allow multiple models to be loaded at one time and allow multiple examples to be classified in one request. If you add a general enough data structure for a "document" then this might be a pretty useful service. I don't think that rest is a good choice at that point, but Avro or protobufs/netty would be. Thrift might work well enough as well, but polymorphism would be nice to have. Regardless of the on-line nature of the SGD models, I don't think that training in the real-time service is such a great thing. I could be wrong. On Mon, Apr 25, 2011 at 3:45 AM, Grant Ingersoll <[email protected]>wrote: > For classifiers/clustering, it seems like one should be able to start > simple: > > 1. Train > 2. Test (including cross-validation) > 3. Classify/Cluster (for each algorithm) both sequentially and on M/R > (again, could submit files to external resources like Amazon) > 4. Add/Delete/Update examples (for training and testing) > > I realize this is non-trivial and there are a lot of details to work out, > particularly on the spawning of M/R jobs and the "single data point against > multiple models" approach, but the rest isn't as bad, I don't think. > >
