Finally coming back to this... On Apr 8, 2011, at 3:35 PM, Frank Scholten wrote:
> On Mon, Apr 4, 2011 at 4:43 PM, Grant Ingersoll <[email protected]> wrote: >> I've always wondered why Collections doesn't show up in IntelliJ when I >> point it at the top level POM. >> >> FWIW, you are active in the community in terms of discussions, etc. >> Emeritus isn't solely about code. >> >> Longer term, I'd love us to have a REST based front end for Mahout. Submit >> jobs over REST, run them in the cluster or on EMR, etc. Ted's Thrift >> server in Mahout in Action might be the basis for one. >> That's maybe more than "plumbing repair" > > REST interface would be really cool indeed. Maybe Restlet could be > used for that? See www.restlet.org Yeah, that's what I was thinking and I've got a start of one thanks to a really long plane ride (mostly b/c I've been meaning to learn restlet). The thing is, our cmd line interface is nice (although it's getting a bit cumbersome in terms of the number of items in it -- just try doing ./mahout --help to see what I mean) but most people want programmatic access and the APIs are not exactly clear cut or well documented. <aside> > This makes a great example because it shows all the kinds of plumbing needed > for a classifier farm. It isn't easily > generalized because the actual classification operation needed is not very > general. In real life, people need to send > 1000 classification requests at once. Or they need to run a single data > point against 1000 models. Or ... stuff. Ted, would it make sense to start w/ the first case, i.e. a high volume server for classifying? We don't have to solve the world's problems all at once. That one seems a little bit more realistic and easier to do: you have a model, it's loaded up, requests come in, you classify them, lather, rinse, repeat. </aside> Back to restlets, I started with the recommenders, but I think it can be extended to clustering and classification and that it can account for the different approaches Ted described in the earlier email, but it will take a bit to get there. For the recommenders, the start of what I have can (will be able to), amongst other things, input ratings, get recommendations, delete ratings, etc. Still needs a configuration framework for storing settings (such as what recommender to use, persisting the model, etc., would also be cool if it could replicate/distribute the model, too, perhaps using some distributed data store like Cassandra or others) I can see this also doing other things like spawning the offline Hadoop jobs and then loading up the model when done (possibly even to Amazon using the EC2/EMR SDK) For classifiers/clustering, it seems like one should be able to start simple: 1. Train 2. Test (including cross-validation) 3. Classify/Cluster (for each algorithm) both sequentially and on M/R (again, could submit files to external resources like Amazon) 4. Add/Delete/Update examples (for training and testing) I realize this is non-trivial and there are a lot of details to work out, particularly on the spawning of M/R jobs and the "single data point against multiple models" approach, but the rest isn't as bad, I don't think. I haven't thought about the other things yet, like frequent patternset mining, but that should also work, esp. once we have a framework in place for submitting jobs, etc. -Grant
