Sorry, as I was saying, the machine learning approach, is NOT limited to having lots of user action data. In fact having little or no user action data is commonly referred to as the "cold start problem" in recommender systems. In which case, it is useful to exploit content based similarities as well as context (such as location, time-of-the-day, day-of-the-week, site-section, device type, etc) to make predictions/scoring. This can still be combined with the usual IR based scoring to keep semantics as the driving force.
-J On Monday, May 4, 2015, J. Delgado <[email protected]> wrote: > BTW, as i mentioned, the machine learning > > On Monday, May 4, 2015, J. Delgado <[email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: > >> I totally agree that it depends at the task at hand and the >> amount/quality of the data that you can get hold of. >> >> The problem of relevancy in traditional document/semantic information >> retrieval (IR) task is such a hard thing because there is little or no >> source of truth you could use as training data (unless you you something >> like TREC for a limited set of documents to evaluate) in most cases. >> Additionally the feedback data you get from users, if it exists, is very >> noisy. It this case prior knowledge, encoded as attributes-weights, crafted >> functions, and heuristics is your best bet. You can however mine the >> content itself by leveraging clustering/topic modeling via LDA which is >> unsupervised learning algorithm and use that as input. Or perhaps >> Labeled-LDA and Multi-Grain LDA, another topic model for classification and >> sentiment analysis, which are supervised algorithms, in which case you can >> still use the approach I suggested. >> >> However, for search tasks that involve e-commerce, advertisements, >> recommendations, etc., there seems to be more data that can be captured >> from users interactions with the system/site, that can be used as signals >> and users' actions (adding things to wish lists, clicks for more info, >> conversions, etc.) is much more telling about the intention/values the user >> give to what is presented to them. Then viewing search as a machine >> learning/multi-objective optimization problem makes sense. >> >> My point is that search engines nowadays is used for all these use cases, >> thus it is worth exploring all the venues exposed in this thread. >> >> Cheers, >> >> -- Joaquin >> >> On Mon, May 4, 2015 at 2:31 PM, Tom Burton-West <[email protected]> >> wrote: >> >>> Hi Doug and Joaquin, >>> >>> This is a really interesting discussion. Joaquin, I'm looking forward >>> to taking your code for a test drive. Thank you for making it publicly >>> available. >>> >>> Doug, I'm interested in your pyramid observation. I work with academic >>> search which has some of the problems unique queries/information needs and >>> of data sparsity you mention in your blog post. >>> >>> This article makes a similar argument that massive amounts of user data >>> are so important for modern search engines that it is essentially a barrier >>> to entry for new web search engines. >>> Usage Data in Web Search: Benefits and Limitations. Ricardo Baeza-Yates and >>> Yoelle Maarek. In Proceedings of SSDBM'2012, Chania, Crete, June 2012. >>> http://www.springerlink.com/index/58255K40151U036N.pdf >>> >>> Tom >>> >>> >>>> I noticed that information retrieval problems fall into a sort-of >>>> layered pyramid. At the topmopst point is someone like Google where the >>>> sheer amount of high quality user behavior data that search truly is a >>>> machine learning problem, much as you propose. As you move down the pyramid >>>> the quality of user data diminishes. >>>> >>>> Eventually you get to a very thick layer of middle-class search >>>> applications that value relevance, but have very modest amounts or no user >>>> data. For most of them, even if they tracked their searches over a year, >>>> they *might* get good data over their top 50 searches. (I know cause they >>>> send me the spreadsheet and say fix it!). The best they can use analytics >>>> data is after-action troubleshooting. Actual user emails complaining about >>>> the search can be more useful than behavior data! >>>> >>>> >>>> >>
