[jira] Updated: (LUCENE-1853) SubPhraseQuery for matching and scoring sub phrase matches.

Preetam Rao (JIRA) Sun, 30 Aug 2009 22:07:57 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Preetam Rao updated LUCENE-1853:
--------------------------------

    Description: 
The goal is to give more control via configuration when searching using user 
entered queries against multiple fields where sub phrases have special 
significance.

For a query like "homes in new york with swimming pool", if a document's field 
matches only "new york" it should get scored and it should get scored higher 
than two separate matches "new" and "york".  Also, a 3 word sub phrase match 
must gets scored considerably higher than a 2 word sub phrase match. (boost 
factor should be configurable)

Using shingles for this use case, means each field of each document needs to be 
indexed as shingles of all (1..N)-grams as well as the query. (Please correct 
me if I am wrong.)

The query could also support 
- ignoring of idf and/or field norms, (so that factors outside the document 
don't influence scoring)
- consider only the longest match (for example match on "new york" is scored 
and considered rather than "new" furniture and "york" city)
- ignore duplicates ("new york" appearing twice or thrice does not make any 
difference)

This kind of query  could be combined with DisMax query. For example, something 
like solr's dismax request handler can be made to use this query where we run a 
user query as it is against all fields and configure each field with above 
configurations.

I have also attached a patch with comments and test cases in case, my 
description is not clear enough. Would appreciate alternatives or feedback. 

Example Usage:

<code>
   // sub phrase config
    SubPhraseQuery.SubPhraseConfig conf = new SubPhraseQuery.SubPhraseConfig();
    conf.ignoreIdf = true;
    conf.ignoreFieldNorms = true;
    conf.matchOnlyLongest = true;
    conf.ignoreDuplicates = true;
    conf.phraseBoost = 2;
    // phrase query as usual
   SubPhraseQuery pq = new SubPhraseQuery();
   pq.add(new Term("f", term));
   pq.add(new Term("f", term));
    pq.setSubPhraseConf(conf);
    Hits hits = searcher.search(pq);
</code>

  was:
For a query like "homes in new york with swimming pool", if a document's field 
matches only "new york" it should get scored and it should get scored higher 
than two separate matches "new" and "york".  Also, a 3 word sub phrase match 
must gets scored considerably higher than a 2 word sub phrase match. (boost 
factor should be configurable)

If a user query is taken as is without parsing and is searched against multiple 
fields, where each sub-phrase can match against a different field, this kind of 
query is useful. 

Using shingles for this use case, means each field of each document needs to be 
indexed as shingles of all (1..N)-grams as well as the query. (Please correct 
me if I am wrong.)

The scorer could also support 
- ignoring of idf and/or field norms, (so that factors outside the document 
don't influence scoring)
- consider only the longest match (for example match on "new york" is scored 
and considered rather than "new" furniture and "york" city)
- ignore duplicates ("new york" appearing twice or thrice does not make any 
difference)

This kind of query (Phrase Query with SubPhraseScorer) could be combined with 
DisMax query. For example, something like solr's dismax request handler can be 
made to use this query where we run a user query as it is against all fields 
and configure each field with above configurations.

I have also attached a patch with comments and test cases in case, my 
description is not clear enough. Would appreciate alternatives or feedback. The 
goal is to give more control via configuration when searching using user 
entered queries against multiple fields where sub phrases have special 
significance.

Example Usage:

<code>
   // sub phrase config
    PhraseQuery.SubPhraseConfig conf = new PhraseQuery.SubPhraseConfig();
    conf.ignoreIdf = true;
    conf.ignoreFieldNorms = true;
    conf.matchOnlyLongest = true;
    conf.ignoreDuplicates = true;
    conf.phraseBoost = 2;
    // phrase query as usual
   PhraseQuery pq = new PhraseQuery();
   pq.add(new Term("f", term));
   pq.add(new Term("f", term));
    pq.setSubPhraseConf(conf);
    Hits hits = searcher.search(pq);
</code>

        Summary: SubPhraseQuery for matching and scoring sub phrase matches.  
(was: PhraseQuery Scorer for scoring sub phrase matches)

Removed the dependency on PhraseQuery so that this can be reviewed  and used 
independently. Made it a separate query with configurations specific to sub 
phrase matches,  The new patch makes no changes to any of existing files. 
Please let me know your thoughts.

> SubPhraseQuery for matching and scoring sub phrase matches.
> -----------------------------------------------------------
>
>                 Key: LUCENE-1853
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1853
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>         Environment: Lucene/Java
>            Reporter: Preetam Rao
>            Priority: Minor
>         Attachments: LUCENE-1853.patch, LUCENE-1853.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> The goal is to give more control via configuration when searching using user 
> entered queries against multiple fields where sub phrases have special 
> significance.
> For a query like "homes in new york with swimming pool", if a document's 
> field matches only "new york" it should get scored and it should get scored 
> higher than two separate matches "new" and "york".  Also, a 3 word sub phrase 
> match must gets scored considerably higher than a 2 word sub phrase match. 
> (boost factor should be configurable)
> Using shingles for this use case, means each field of each document needs to 
> be indexed as shingles of all (1..N)-grams as well as the query. (Please 
> correct me if I am wrong.)
> The query could also support 
> - ignoring of idf and/or field norms, (so that factors outside the document 
> don't influence scoring)
> - consider only the longest match (for example match on "new york" is scored 
> and considered rather than "new" furniture and "york" city)
> - ignore duplicates ("new york" appearing twice or thrice does not make any 
> difference)
> This kind of query  could be combined with DisMax query. For example, 
> something like solr's dismax request handler can be made to use this query 
> where we run a user query as it is against all fields and configure each 
> field with above configurations.
> I have also attached a patch with comments and test cases in case, my 
> description is not clear enough. Would appreciate alternatives or feedback. 
> Example Usage:
> <code>
>    // sub phrase config
>     SubPhraseQuery.SubPhraseConfig conf = new 
> SubPhraseQuery.SubPhraseConfig();
>     conf.ignoreIdf = true;
>     conf.ignoreFieldNorms = true;
>     conf.matchOnlyLongest = true;
>     conf.ignoreDuplicates = true;
>     conf.phraseBoost = 2;
>     // phrase query as usual
>    SubPhraseQuery pq = new SubPhraseQuery();
>    pq.add(new Term("f", term));
>    pq.add(new Term("f", term));
>     pq.setSubPhraseConf(conf);
>     Hits hits = searcher.search(pq);
> </code>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Updated: (LUCENE-1853) SubPhraseQuery for matching and scoring sub phrase matches.

Reply via email to