Hi Doug and Joaquin, This is a really interesting discussion. Joaquin, I'm looking forward to taking your code for a test drive. Thank you for making it publicly available.
Doug, I'm interested in your pyramid observation. I work with academic search which has some of the problems unique queries/information needs and of data sparsity you mention in your blog post. This article makes a similar argument that massive amounts of user data are so important for modern search engines that it is essentially a barrier to entry for new web search engines. Usage Data in Web Search: Benefits and Limitations. Ricardo Baeza-Yates and Yoelle Maarek. In Proceedings of SSDBM'2012, Chania, Crete, June 2012. http://www.springerlink.com/index/58255K40151U036N.pdf Tom > I noticed that information retrieval problems fall into a sort-of layered > pyramid. At the topmopst point is someone like Google where the sheer > amount of high quality user behavior data that search truly is a machine > learning problem, much as you propose. As you move down the pyramid the > quality of user data diminishes. > > Eventually you get to a very thick layer of middle-class search > applications that value relevance, but have very modest amounts or no user > data. For most of them, even if they tracked their searches over a year, > they *might* get good data over their top 50 searches. (I know cause they > send me the spreadsheet and say fix it!). The best they can use analytics > data is after-action troubleshooting. Actual user emails complaining about > the search can be more useful than behavior data! > > >
