[ 
https://issues.apache.org/jira/browse/SOLR-12812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated SOLR-12812:
-------------------------------
    Description: 
In this issue I would like to add support for streaming MLT case where the 
content of the request specifies explicitly the field:text pairs to be used for 
MLT lookup.

A longer explanation why the current solutions are not working based on a real 
use case. 

Let's say a solr instance has multiple cores (collections of documents). We'd 
like to search for similar documents between these cores. Let's assume each 
collection of documents has three fields: title, summary and abstract.

At the moment Solr has two MLT handler options: the query-based (similarity to 
an indexed document) and the free-text based (similarity to an arbitrary text).

1) The first MLT pipeline in Solr looks for documents similar to the given one 
(I'll assume a single document as input, to keep things simple). This
pipeline reads the content of the document from the existing index and creates
a mapping between fields and actual values stored in that document.
Let's say the document looks like this:

title: foo bar
summary: baz bar
abstract: ping ping

The "interesting term" extraction routine in MoreLikeThisHelper will extract 
those terms and
score them against each field's statistics, then take top-N best scoring terms 
(and fields they're assigned to) and create a Boolean query from it. It could 
go something like this:

title:foo^1.5 summary:bar^0.5

When this query is applied against the collection it would *not* match "bar" in 
the title or abstract (because the weighted "important" term wasn't selected in 
that field). That's the way it should be.

2) In the second pipeline, we give the full "text" for which we wish to obtain 
similar documents. If we were to emulate scenario (1), we'd have to cram the 
content of each field into a single blob of text, so it'd become something like:

foo bar, baz bar, ping ping

Solr takes this text and creates a pseudo-document that maps the
provided set of fields (mlt.fl) to this value. So effectively it
creates a pseudo-document like this:

title: foo bar, baz bar, ping ping
summary: foo bar, baz bar, ping ping
abstract: foo bar, baz bar, ping ping

What follows is identical to scenario (1), but note that this time the
set of terms for each field (and their scores) are much broader. This
means that the final query can look like this:

title:foo^1.5 summary:foo^0.5 title:bar^1 summary:bar^0.5

This results in severely skewed MLT results (for example shorter fields will 
have drastically different term statistics).



> Add support for arbitrary field:text pairs to streaming similarity 
> calculation in MoreLikeThisHandler
> -----------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-12812
>                 URL: https://issues.apache.org/jira/browse/SOLR-12812
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Dawid Weiss
>            Priority: Minor
>
> In this issue I would like to add support for streaming MLT case where the 
> content of the request specifies explicitly the field:text pairs to be used 
> for MLT lookup.
> A longer explanation why the current solutions are not working based on a 
> real use case. 
> Let's say a solr instance has multiple cores (collections of documents). We'd 
> like to search for similar documents between these cores. Let's assume each 
> collection of documents has three fields: title, summary and abstract.
> At the moment Solr has two MLT handler options: the query-based (similarity 
> to an indexed document) and the free-text based (similarity to an arbitrary 
> text).
> 1) The first MLT pipeline in Solr looks for documents similar to the given 
> one 
> (I'll assume a single document as input, to keep things simple). This
> pipeline reads the content of the document from the existing index and creates
> a mapping between fields and actual values stored in that document.
> Let's say the document looks like this:
> title: foo bar
> summary: baz bar
> abstract: ping ping
> The "interesting term" extraction routine in MoreLikeThisHelper will extract 
> those terms and
> score them against each field's statistics, then take top-N best scoring 
> terms (and fields they're assigned to) and create a Boolean query from it. It 
> could go something like this:
> title:foo^1.5 summary:bar^0.5
> When this query is applied against the collection it would *not* match "bar" 
> in the title or abstract (because the weighted "important" term wasn't 
> selected in that field). That's the way it should be.
> 2) In the second pipeline, we give the full "text" for which we wish to 
> obtain similar documents. If we were to emulate scenario (1), we'd have to 
> cram the content of each field into a single blob of text, so it'd become 
> something like:
> foo bar, baz bar, ping ping
> Solr takes this text and creates a pseudo-document that maps the
> provided set of fields (mlt.fl) to this value. So effectively it
> creates a pseudo-document like this:
> title: foo bar, baz bar, ping ping
> summary: foo bar, baz bar, ping ping
> abstract: foo bar, baz bar, ping ping
> What follows is identical to scenario (1), but note that this time the
> set of terms for each field (and their scores) are much broader. This
> means that the final query can look like this:
> title:foo^1.5 summary:foo^0.5 title:bar^1 summary:bar^0.5
> This results in severely skewed MLT results (for example shorter fields will 
> have drastically different term statistics).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to