[ 
https://issues.apache.org/jira/browse/LUCENE-6339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Areek Zillur updated LUCENE-6339:
---------------------------------
    Attachment: LUCENE-6339.patch

Thanks [~mikemccand] for the review!
{quote}
In TopSuggestDocsCollector:
In collect, we seem to assume the suggest searcher will never call
collect more than num times? How is that? If so, can you add that to
the javadocs, and maybe add an assert upto < num in collect?
Can we just allocate scoreDocs up front instead of lazily?
In the javadocs, instead of "one hit can be..." maybe "one doc can
be..."? Hit is a tricky word in this context since it could be a doc
or a suggestion...
{quote}

I have re written {{TopSuggestDocsCollector}} to have a priority queue at the 
top-level instead, somewhat similar to {{TopDocsCollector}}.
Now completions across segments are collected in the same pq, this allows early 
termination for suggesters at the segment level 
(when a collected completion overflows the pq, we can disregard the rest of the 
completions for that segment, 
as completions are collected in order of their scores).

{quote}
In SuggestIndexSearcher, does it really ever make sense to take a
generic Collector/LeafCollector? Can we instead just strongly type
the params to all the methods to be TopSuggestDocsCollector?
{quote}
Thanks for the suggestion! the generic Collector/LeafCollector is removed.
Current API:
{code:java}
public void suggest(String field, CharSequence key, int num, Filter filter, 
TopSuggestDocsCollector collector) 
{code}

{quote}
"In case a filter has to be applied, the queue size is doubled" is not
quite correct? Maybe change the logic there so the int queueSize is
first computed, and then if filter is enabled, it's doubled?
{quote}
Now the queueSize is increased by half the # of live docs in the segment 
instead. If a filter is applied, the queue size should 
be increased w.r.t. to # of documents.
if the applied filter filters out <= half of the top scoring documents for a 
query prefix, then the search is admissible.
if a filter is too restrictive, then the search is inadmissible. a work around 
would be to multiply {{num}} by some factor, 
in this case early termination might help (if {{TopSuggestDocsCollector}} is 
initialized with the original {{num}}). thoughts?

Updated Patch:
 - SuggestIndexSearcher cleanup
 - TopSuggestDocsCollector re-write
 - remove WeightProcessor from NRTSuggester
 - added more tests (including boundary cases for deleted/filtered out 
documents)

> [suggest] Near real time Document Suggester
> -------------------------------------------
>
>                 Key: LUCENE-6339
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6339
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/search
>    Affects Versions: 5.0
>            Reporter: Areek Zillur
>            Assignee: Areek Zillur
>             Fix For: 5.0
>
>         Attachments: LUCENE-6339.patch, LUCENE-6339.patch, LUCENE-6339.patch, 
> LUCENE-6339.patch
>
>
> The idea is to index documents with one or more *SuggestField*(s) and be able 
> to suggest documents with a *SuggestField* value that matches a given key.
> A SuggestField can be assigned a numeric weight to be used to score the 
> suggestion at query time.
> Document suggestion can be done on an indexed *SuggestField*. The document 
> suggester can filter out deleted documents in near real-time. The suggester 
> can filter out documents based on a Filter (note: may change to a non-scoring 
> query?) at query time.
> A custom postings format (CompletionPostingsFormat) is used to index 
> SuggestField(s) and perform document suggestions.
> h4. Usage
> {code:java}
>   // hook up custom postings format
>   // indexAnalyzer for SuggestField
>   Analyzer analyzer = ...
>   IndexWriterConfig config = new IndexWriterConfig(analyzer);
>   Codec codec = new Lucene50Codec() {
>     PostingsFormat completionPostingsFormat = new 
> Completion50PostingsFormat();
>     @Override
>     public PostingsFormat getPostingsFormatForField(String field) {
>       if (isSuggestField(field)) {
>         return completionPostingsFormat;
>       }
>       return super.getPostingsFormatForField(field);
>     }
>   };
>   config.setCodec(codec);
>   IndexWriter writer = new IndexWriter(dir, config);
>   // index some documents with suggestions
>   Document doc = new Document();
>   doc.add(new SuggestField("suggest_title", "title1", 2));
>   doc.add(new SuggestField("suggest_name", "name1", 3));
>   writer.addDocument(doc)
>   ...
>   // open an nrt reader for the directory
>   DirectoryReader reader = DirectoryReader.open(writer, false);
>   // SuggestIndexSearcher is a thin wrapper over IndexSearcher
>   // queryAnalyzer will be used to analyze the query string
>   SuggestIndexSearcher indexSearcher = new SuggestIndexSearcher(reader, 
> queryAnalyzer);
>   
>   // suggest 10 documents for "titl" on "suggest_title" field
>   TopSuggestDocs suggest = indexSearcher.suggest("suggest_title", "titl", 10);
> {code}
> h4. Indexing
> Index analyzer set through *IndexWriterConfig*
> {code:java}
> SuggestField(String name, String value, long weight) 
> {code}
> h4. Query
> Query analyzer set through *SuggestIndexSearcher*.
> Hits are collected in descending order of the suggestion's weight 
> {code:java}
> // full options for TopSuggestDocs (TopDocs)
> TopSuggestDocs suggest(String field, CharSequence key, int num, Filter filter)
> // full options for Collector
> // note: only collects does not score
> void suggest(String field, CharSequence key, int maxNumPerLeaf, Filter 
> filter, Collector collector)
> {code}
> h4. Analyzer
> *CompletionAnalyzer* can be used instead to wrap another analyzer to tune 
> suggest field only parameters. 
> {code:java}
> CompletionAnalyzer(Analyzer analyzer, boolean preserveSep, boolean 
> preservePositionIncrements, int maxGraphExpansions)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to