[ https://issues.apache.org/jira/browse/LUCENE-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Preetam Rao updated LUCENE-1853: -------------------------------- Attachment: LUCENE-1853.patch Remove the dependency on PhraseQuery. Create a new Query called "SubPhraseQuery". Created a new patch with seperate new source files, without any changes to existing files. > PhraseQuery Scorer for scoring sub phrase matches > ------------------------------------------------- > > Key: LUCENE-1853 > URL: https://issues.apache.org/jira/browse/LUCENE-1853 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Environment: Lucene/Java > Reporter: Preetam Rao > Priority: Minor > Attachments: LUCENE-1853.patch, LUCENE-1853.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > For a query like "homes in new york with swimming pool", if a document's > field matches only "new york" it should get scored and it should get scored > higher than two separate matches "new" and "york". Also, a 3 word sub phrase > match must gets scored considerably higher than a 2 word sub phrase match. > (boost factor should be configurable) > If a user query is taken as is without parsing and is searched against > multiple fields, where each sub-phrase can match against a different field, > this kind of query is useful. > Using shingles for this use case, means each field of each document needs to > be indexed as shingles of all (1..N)-grams as well as the query. (Please > correct me if I am wrong.) > The scorer could also support > - ignoring of idf and/or field norms, (so that factors outside the document > don't influence scoring) > - consider only the longest match (for example match on "new york" is scored > and considered rather than "new" furniture and "york" city) > - ignore duplicates ("new york" appearing twice or thrice does not make any > difference) > This kind of query (Phrase Query with SubPhraseScorer) could be combined with > DisMax query. For example, something like solr's dismax request handler can > be made to use this query where we run a user query as it is against all > fields and configure each field with above configurations. > I have also attached a patch with comments and test cases in case, my > description is not clear enough. Would appreciate alternatives or feedback. > The goal is to give more control via configuration when searching using user > entered queries against multiple fields where sub phrases have special > significance. > Example Usage: > <code> > // sub phrase config > PhraseQuery.SubPhraseConfig conf = new PhraseQuery.SubPhraseConfig(); > conf.ignoreIdf = true; > conf.ignoreFieldNorms = true; > conf.matchOnlyLongest = true; > conf.ignoreDuplicates = true; > conf.phraseBoost = 2; > // phrase query as usual > PhraseQuery pq = new PhraseQuery(); > pq.add(new Term("f", term)); > pq.add(new Term("f", term)); > pq.setSubPhraseConf(conf); > Hits hits = searcher.search(pq); > </code> -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org