Doug,

Thanks for your insights. We actually started with trying to build off of
features and boosting weights combined with built-in relevance scoring
<http://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html>.
We also played around with replacing and/or combining the default score
with other computations using function_score
<http://www.elastic.co/guide/en/elasticsearch/guide/current/function-score-query.html>
query, with
but as you mentioned in your article, the crux of the problem is *how to
figure out the weights that control each features influence*:

"*Once important features are placed in the search engine the final problem
becomes balancing and regulating their influence. Should text-based factors
matter more than sales based factors? Should exact text matches matter more
than synonym-based matches? What about metadata we glean from machine
learning – how much weight should this play*?"

Furthermore, this only covers cases where the scoring can be represented as
a function of such weights! We felt that this approach was short sighted as
some of the problems we are dealing with (e.g. product recommendations,
response prediction, real-time bidding for advertising, etc) have a very
large feature space, sometimes requiring *dimensionality reduction* (e.g.
Matrix Factorization techniques) or learning from past actions/feedback
(e.g. clickthrough data, bidding win rates, remaining budget, etc.). All
this seemed well suited for for Machine (supervised) Learning tasks such as
prediction based on past training data (classification or regression).
These algorithms usually have an offline model building phase and an online
evaluator phase that uses the created model to perform the
prediction/scoring during query evaluation.  Additionally, some of the best
algorithms in machine learning (Random Forest, Support Vector Machines,
Deep Learning/Neural Networks, etc.) are not linear combinations of
feature-weights requiring additional data structure (e.g. trees, support
vectors) to support the computation.

Since there is no one-size-fits all predictive algorithm we architected the
solution so any algorithm that implements our interface can be used. We
tried this out with algorithms available in Weka
<http://www.cs.waikato.ac.nz/ml/weka/> and Spark MLib
<https://spark.apache.org/docs/1.2.1/mllib-guide.html> (only linear models
for now) and it worked! In any case, nothing prevents us from leverage the
text based analysis of features and the default scoring available within
the plugin, which can be combined with the results of the prediction.

To demonstrate its general utility we tested this with datasets available
at the the UCI Machine Learning Repository <http://archive.ics.uci.edu/ml/> but
I have been using this approach for real-life response prediction/bidding
problems in advertising and its very powerful. Of course, this is not the
panacea, as there are still some issues with the approach, specially on the
operational side.  Let's keep the conversation going as I think we are on
to something useful.

-- Joaquin


On Thu, Apr 30, 2015 at 6:26 AM, Doug Turnbull <
[email protected]> wrote:

> Hi Joaquin
>
> Very neat, thanks for sharing,
>
> Viewing search relevance as something akin to a classification problem is
> actually a driving narrative in Taming Search
> <http://manning.com/turnbull>. We generalize the relevance problem as one
> of measuring the similarity between features of content (locations of
> restaurants, price of a product, the words in the body of articles,
> expanded synonyms in articles, etc) and features of a query (the search
> terms, user usage history, any location, etc). What makes search
> interesting is that unlike other classification systems, search has built
> in similarity systems (largely TF*IDF).
>
> So we actually cut the other direction from your talk. It appears that you
> amend the search engine to change the underlying scoring to be based on
> machine learning constructs. In our book, we work the opposite way. We
> largely enable feature similarity classifications between document and
> query by massaging features into terms and use the built in TF*IDF or other
> relevant similarity approach.
>
> We feel this plays to the advantages of a search engine. Search engines
> already have some basic text analysis built in. They've also been heavily
> optimized for most forms of text-based similarity. If you can massage text
> such that your TF*IDF similarity reflects a rough proportion of text-based
> features important to your users, this tends to reflect their intuitive
> notions of relevance. A lot of this work involves feature section, or what
> we term in the book feature modeling. What features should you introduce to
> your documents that can be used to generate good signals at ranking time.
>
> You can read more about our thoughts here
> <http://java.dzone.com/articles/solr-and-elasticsearch>.
>
> That all being said, what makes your stuff interesting is when you have
> enough supervised training data over good-enough features. This can be hard
> to do for a broad swatch of "middle tier" search applications, but
> increasingly useful as scale goes up. I'd be interested to hear your
> thoughts on this article
> <http://opensourceconnections.com/blog/2014/10/08/when-click-scoring-can-hurt-search-relevance-a-roadmap-to-better-signals-processing-in-search/>
> I wrote about collecting click tracking and other relevance feedback data:
>
> Good stuff! Again, thanks for sharing,
> -Doug
>
>
>
> On Wed, Apr 29, 2015 at 6:58 PM, J. Delgado <[email protected]>
> wrote:
>
>> Here is a presentation on the topic:
>>
>> http://www.slideshare.net/joaquindelgado1/where-search-meets-machine-learning04252015final
>>
>> Search can be viewed as a combination of a) A problem of constraint
>> satisfaction, which is the process of finding a solution to a set of
>> constraints (query) that impose conditions that the variables (fields) must
>> satisfy with a resulting object (document) being a solution in the feasible
>> region (result set), plus b) A scoring/ranking problem of assigning values
>> to different alternatives, according to some convenient scale. This
>> ultimately provides a mechanism to sort various alternatives in the result
>> set in order of importance, value or preference. In particular scoring in
>> search has evolved from being a document centric calculation (e.g. TF-IDF)
>> proper from its information retrieval roots, to a function that is more
>> context sensitive (e.g. include geo-distance ranking) or user centric (e.g.
>> takes user parameters for personalization) as well as other factors that
>> depend on the domain and task at hand. However, most system that
>> incorporate machine learning techniques to perform classification or
>> generate scores for these specialized tasks do so as a post retrieval
>> re-ranking function, outside of search! In this talk I show ways of
>> incorporating advanced scoring functions, based on supervised learning and
>> bid scaling models, into popular search engines such as Elastic Search and
>> potentially SOLR. I'll provide practical examples of how to construct such
>> "ML Scoring" plugins in search to generalize the application of a search
>> engine as a model evaluator for supervised learning tasks. This will
>> facilitate the building of systems that can do computational advertising,
>> recommendations and specialized search systems, applicable to many domains.
>>
>> Code to support it (only elastic search for now):
>> https://github.com/sdhu/elasticsearch-prediction
>>
>> -- J
>>
>>
>>
>>
>>
>
>
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
> LLC | 240.476.9983 | http://www.opensourceconnections.com
> Author: Taming Search <http://manning.com/turnbull> from Manning
> Publications
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>

Reply via email to