On Apr 25, 2011, at 12:13 PM, Ted Dunning wrote: > Actually, I think that I would put train and test out of the fast server. > The train/test/manage training set can all be done off-line. > > In my book example, I suggested that ZK be used to cause models to be loaded > and that Thrift (or rest or Avro or whatever) be used just for doing > classification. A more general structure would allow multiple models to be > loaded at one time and allow multiple examples to be classified in one > request. > > If you add a general enough data structure for a "document" then this might > be a pretty useful service. I don't think that rest is a good choice at > that point, but Avro or protobufs/netty would be. Thrift might work well > enough as well, but polymorphism would be nice to have.
I'm not sure we are talking about the same thing. I'm mostly looking for an API front end that makes it easier to consume Mahout programmatically from anywhere and via any language (in other words, a server that does what ./bin/mahout does but is better organized and accessible). I get the sense that you are talking about more of the underlying implementation of managing all of the appropriate stuff that needs to happen to actually do the work that the API front end is asking Mahout to do, but perhaps I'm confused. > > Regardless of the on-line nature of the SGD models, I don't think that > training in the real-time service is such a great thing. I could be wrong. No, I don't think you are wrong, again, I don't think any of this implies a real time service, etc. it's just more about making it easier for people to programmatically use Mahout. > > > > On Mon, Apr 25, 2011 at 3:45 AM, Grant Ingersoll <[email protected]>wrote: > >> For classifiers/clustering, it seems like one should be able to start >> simple: >> >> 1. Train >> 2. Test (including cross-validation) >> 3. Classify/Cluster (for each algorithm) both sequentially and on M/R >> (again, could submit files to external resources like Amazon) >> 4. Add/Delete/Update examples (for training and testing) >> >> I realize this is non-trivial and there are a lot of details to work out, >> particularly on the spawning of M/R jobs and the "single data point against >> multiple models" approach, but the rest isn't as bad, I don't think. >> >>
