I totally agree that it depends at the task at hand and the amount/quality of the data that you can get hold of.
The problem of relevancy in traditional document/semantic information retrieval (IR) task is such a hard thing because there is little or no source of truth you could use as training data (unless you you something like TREC for a limited set of documents to evaluate) in most cases. Additionally the feedback data you get from users, if it exists, is very noisy. It this case prior knowledge, encoded as attributes-weights, crafted functions, and heuristics is your best bet. You can however mine the content itself by leveraging clustering/topic modeling via LDA which is unsupervised learning algorithm and use that as input. Or perhaps Labeled-LDA and Multi-Grain LDA, another topic model for classification and sentiment analysis, which are supervised algorithms, in which case you can still use the approach I suggested. However, for search tasks that involve e-commerce, advertisements, recommendations, etc., there seems to be more data that can be captured from users interactions with the system/site, that can be used as signals and users' actions (adding things to wish lists, clicks for more info, conversions, etc.) is much more telling about the intention/values the user give to what is presented to them. Then viewing search as a machine learning/multi-objective optimization problem makes sense. My point is that search engines nowadays is used for all these use cases, thus it is worth exploring all the venues exposed in this thread. Cheers, -- Joaquin On Mon, May 4, 2015 at 2:31 PM, Tom Burton-West <[email protected]> wrote: > Hi Doug and Joaquin, > > This is a really interesting discussion. Joaquin, I'm looking forward to > taking your code for a test drive. Thank you for making it publicly > available. > > Doug, I'm interested in your pyramid observation. I work with academic > search which has some of the problems unique queries/information needs and > of data sparsity you mention in your blog post. > > This article makes a similar argument that massive amounts of user data > are so important for modern search engines that it is essentially a barrier > to entry for new web search engines. > Usage Data in Web Search: Benefits and Limitations. Ricardo Baeza-Yates and > Yoelle Maarek. In Proceedings of SSDBM'2012, Chania, Crete, June 2012. > http://www.springerlink.com/index/58255K40151U036N.pdf > > Tom > > >> I noticed that information retrieval problems fall into a sort-of layered >> pyramid. At the topmopst point is someone like Google where the sheer >> amount of high quality user behavior data that search truly is a machine >> learning problem, much as you propose. As you move down the pyramid the >> quality of user data diminishes. >> >> Eventually you get to a very thick layer of middle-class search >> applications that value relevance, but have very modest amounts or no user >> data. For most of them, even if they tracked their searches over a year, >> they *might* get good data over their top 50 searches. (I know cause they >> send me the spreadsheet and say fix it!). The best they can use analytics >> data is after-action troubleshooting. Actual user emails complaining about >> the search can be more useful than behavior data! >> >> >>
