This is what I believe to be a typical learning to rank model:
1. Create many weak rankers/scorers (a.k.a feature engineering, in Solr these are queries/function queries). 2. Test those scorers on a ground truth dataset. Generating feature vectors for top-n results annotated by human. 3. Use an existing classifier/regressor (e.g. support vector ranking, GBDT, random forest etc.) on those feature vectors to get a ranking model. 4. Export this ranking model back to Solr as a custom ensemble query (a BooleanQuery with custom boosting factor for linear model, or a CustomScoreQuery with custom scoring function for non-linear model), push it to Solr server, register with QParser. Push it to production. End of.
But I didn't find this workflow quite easy to implement in mahout-solr integration (is it discouraged for some reason?). Namely, there is no pipeline from results of scorers to a Mahout-compatible vector form, and there is no pipeline from ranking model back to ensemble query. (I only found the lucene2seq class, and the upcoming recommendation support, which don't quite fit into the scenario). So what's the best practice for easily implementing a realtime, learning to rank search engine in this case? I've worked in a bunch of startups and such appliance seems to be in high demand. (Remember that solr-based collaborative filtering model proposed by Dr Dunning? This is the content-based counterpart of it)
I'm looking forward to streamline this process to make my upcoming work easier. I think Mahout/Solr is the undisputed instrument of choice due to their scalability and machine learning background of many of their top committers. Can we talk about it at some point?
Yours Peng
