BTW, as i mentioned, the machine learning On Monday, May 4, 2015, J. Delgado <[email protected]> wrote:
> I totally agree that it depends at the task at hand and the amount/quality > of the data that you can get hold of. > > The problem of relevancy in traditional document/semantic information > retrieval (IR) task is such a hard thing because there is little or no > source of truth you could use as training data (unless you you something > like TREC for a limited set of documents to evaluate) in most cases. > Additionally the feedback data you get from users, if it exists, is very > noisy. It this case prior knowledge, encoded as attributes-weights, crafted > functions, and heuristics is your best bet. You can however mine the > content itself by leveraging clustering/topic modeling via LDA which is > unsupervised learning algorithm and use that as input. Or perhaps > Labeled-LDA and Multi-Grain LDA, another topic model for classification and > sentiment analysis, which are supervised algorithms, in which case you can > still use the approach I suggested. > > However, for search tasks that involve e-commerce, advertisements, > recommendations, etc., there seems to be more data that can be captured > from users interactions with the system/site, that can be used as signals > and users' actions (adding things to wish lists, clicks for more info, > conversions, etc.) is much more telling about the intention/values the user > give to what is presented to them. Then viewing search as a machine > learning/multi-objective optimization problem makes sense. > > My point is that search engines nowadays is used for all these use cases, > thus it is worth exploring all the venues exposed in this thread. > > Cheers, > > -- Joaquin > > On Mon, May 4, 2015 at 2:31 PM, Tom Burton-West <[email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: > >> Hi Doug and Joaquin, >> >> This is a really interesting discussion. Joaquin, I'm looking forward to >> taking your code for a test drive. Thank you for making it publicly >> available. >> >> Doug, I'm interested in your pyramid observation. I work with academic >> search which has some of the problems unique queries/information needs and >> of data sparsity you mention in your blog post. >> >> This article makes a similar argument that massive amounts of user data >> are so important for modern search engines that it is essentially a barrier >> to entry for new web search engines. >> Usage Data in Web Search: Benefits and Limitations. Ricardo Baeza-Yates and >> Yoelle Maarek. In Proceedings of SSDBM'2012, Chania, Crete, June 2012. >> http://www.springerlink.com/index/58255K40151U036N.pdf >> >> Tom >> >> >>> I noticed that information retrieval problems fall into a sort-of >>> layered pyramid. At the topmopst point is someone like Google where the >>> sheer amount of high quality user behavior data that search truly is a >>> machine learning problem, much as you propose. As you move down the pyramid >>> the quality of user data diminishes. >>> >>> Eventually you get to a very thick layer of middle-class search >>> applications that value relevance, but have very modest amounts or no user >>> data. For most of them, even if they tracked their searches over a year, >>> they *might* get good data over their top 50 searches. (I know cause they >>> send me the spreadsheet and say fix it!). The best they can use analytics >>> data is after-action troubleshooting. Actual user emails complaining about >>> the search can be more useful than behavior data! >>> >>> >>> >
