I totally agree that it depends at the task at hand and the amount/quality
of the data that you can get hold of.

The problem of relevancy in traditional document/semantic information
retrieval (IR) task is such a hard thing because there is little or no
source of truth you could use as training data (unless you you something
like TREC for a limited set of documents to evaluate) in most cases.
Additionally the feedback data you get from users, if it exists, is very
noisy. It this case prior knowledge, encoded as attributes-weights, crafted
functions, and heuristics is your best bet. You can however mine the
content itself by leveraging clustering/topic modeling via LDA which is
unsupervised learning algorithm and use that as input. Or perhaps
Labeled-LDA and Multi-Grain LDA, another topic model for classification and
sentiment analysis, which are supervised algorithms, in which case you can
still use the approach I suggested.

However, for search tasks that involve e-commerce, advertisements,
recommendations, etc., there seems to be more data that can be captured
from users interactions with the system/site, that can be used as signals
and users' actions (adding things to wish lists, clicks for more info,
conversions, etc.) is much more telling about the intention/values the user
give to what is presented to them. Then viewing search as a machine
learning/multi-objective optimization problem makes sense.

My point is that search engines nowadays is used for all these use cases,
thus it is worth exploring all the venues exposed in this thread.

Cheers,

-- Joaquin

On Mon, May 4, 2015 at 2:31 PM, Tom Burton-West <[email protected]> wrote:

> Hi Doug and Joaquin,
>
> This is a really interesting discussion.  Joaquin, I'm looking forward to
> taking your code for a test drive.  Thank you for making it publicly
> available.
>
> Doug,  I'm interested in your pyramid observation.  I work with academic
> search which has some of the problems unique queries/information needs and
> of data sparsity you mention in your blog post.
>
> This article makes a similar argument that massive amounts of user data
> are so important for modern search engines that it is essentially a barrier
> to entry for new web search engines.
> Usage Data in Web Search: Benefits and Limitations. Ricardo Baeza-Yates and
> Yoelle Maarek.  In Proceedings of SSDBM'2012, Chania, Crete, June 2012.
> http://www.springerlink.com/index/58255K40151U036N.pdf
>
>  Tom
>
>
>> I noticed that information retrieval problems fall into a sort-of layered
>> pyramid. At the topmopst point is someone like Google where the sheer
>> amount of high quality user behavior data that search truly is a machine
>> learning problem, much as you propose. As you move down the pyramid the
>> quality of user data diminishes.
>>
>> Eventually you get to a very thick layer of middle-class search
>> applications that value relevance, but have very modest amounts or no user
>> data. For most of them, even if they tracked their searches over a year,
>> they *might* get good data over their top 50 searches. (I know cause they
>> send me the spreadsheet and say fix it!). The best they can use analytics
>> data is after-action troubleshooting. Actual user emails complaining about
>> the search can be more useful than behavior data!
>>
>>
>>

Reply via email to