[jira] [Commented] (SOLR-5683) Documentation of Suggester V2

Varun Thacker (JIRA) Wed, 20 Aug 2014 00:37:15 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-5683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14103562#comment-14103562
 ]


Varun Thacker commented on SOLR-5683:
-------------------------------------

First draft at documenting the suggesters - This covers all the documentation 
wrt suggesters under "Issues from CHANGES.txt that were never doc'ed as part of 
their release:" in the 
https://cwiki.apache.org/confluence/display/solr/Internal+-+TODO+List link.

bq. DocumentDictionaryFactory – user can specify suggestion field along with 
optional weight and payload fields from their search index.

Looking at the code of DocumentDictionaryFactory the weight field is not 
optional.
--------------------------------------------------------------------------------------------------------------------------------------------------------
field - The field from which the suggesters dictionary will be populated.
weightField - The field from which the suggestions weight will be populated. 
This should be a numeric field. Suggestions will be sorted based on the value 
as this is the sole criteria for relevance.
payloadField - Accompanying payload for each suggestion that gets built. 
suggestAnalyzerFieldType - Specify the analyzer to be used for the suggester. 
The "index" analyzer of this fieldType will be used to build the suggest 
dictionary and the "query" analyzer will be used during querying.

Config (index time) options:
name - Name of suggester. This is optional if you have only one suggester 
defined.
sourceLocation - External file location for file-based suggesters only.
lookupImpl - Type of lookup to use whose default is JaspellLookupFactory. A 
table below lists all the various lookup implementations present.
dictionaryImpl - The type of dictionary to be used when building the suggester. 
The default is FileDictionaryFactory for a file-based suggester and it defaults 
to HighFrequencyDictionaryFactory otherwise.
storeDir - Location to store the dictionary on disk.
buildOnCommit - Command to build suggester automatically after every commit 
that is called. Useful if you want to keep the suggester in sync with your 
latest data.
buildOnOptimize - Command to build suggester automatically after every optimize 
that is called. Useful if you want to keep the suggester in sync with your 
latest data.

Query time options:
suggest.dictionary - name of suggester to use
suggest.count - number of suggestions to return
suggest.q - query to use for lookup
suggest.build - command to build the suggester
suggest.reload - command to reload the suggester
buildAll – command to build all suggesters in the component
reloadAll – command to reload all suggesters in the component

--------------------------------------------------------------------------------------------------------------------------------------------------------

Lookup Implementation Options - 
- AnalyzingLookupFactory: Suggester that first analyzes the incoming text and 
adds the analyzed form to a weighted FST, and then does the same thing at 
lookup time.
        suggestAnalyzerFieldType - The analyzer used at "query-time" and 
"build-time" to analyze suggestions.
        exactMatchFirst - If true the exact suggestions are returned first, 
even if they are prefixes of other strings in the FST have larger weights.  
Default is true.
        preserveSep - If true then a separator between tokens is preserved. 
This means that suggestions are sensitive to tokenization (e.g. baseball is 
different from base ball. Default is true.
        preservePositionIncrements - Whether the suggester should preserve 
position increments. What this means is that token filters which leave gaps 
(for example when StopFilter matches a stopword) the position would be 
respected when building the suggester. The default is false.

- FuzzyLookupFactory: This is a suggester which is an extension of the 
AnalyzingSuggester but is fuzzy in nature. The similarity is measured by the 
Levenshtein algorithm.
        exactMatchFirst - If true the exact suggestions are returned first, 
even if they are prefixes of other strings in the FST have larger weights.  
Default is true.
        preserveSep - If true then a separator between tokens is preserved. 
This means that suggestions are sensitive to tokenization (e.g. baseball is 
different from base ball. Default is true.
        maxSurfaceFormsPerAnalyzedForm - Maximum number of surface forms to 
keep for a single analyzed form. When there are too many surface forms we 
discard the lowest weighted ones.
        maxGraphExpansions - When building the FST ("index-time"), we add each 
path through the tokenstream graph as an individual entry. This places an 
upper-bound on how many expansions will be added for a single suggestion. The 
default is -1 which means there is no limit.
        preservePositionIncrements - Whether the suggester should preserve 
position increments. What this means is that token filters which leave gaps 
(for example when StopFilter matches a stopword) the position would be 
respected when building the suggester. The default is false.
        maxEdits - Maximum number of string edits allowed. The systems hard 
limit is 2. The default is 1.
        transpositions - If true transpositions should be treated as a 
primitive edit operation. The default is true.
        nonFuzzyPrefix - The length of the common non fuzzy prefix match which 
must match a suggestion. The default is 1.
        minFuzzyLength - The minimum length of query before which any string 
edits will be allowed. The default is 3.
        unicodeAware -  Measure maxEdits, minFuzzyLength, transpositions and 
nonFuzzyPrefix parameters in unicode code points (actual letters) instead of 
bytes. The default is false.

- AnalyzingInfixSuggesterFactory: Analyzes the input text and then suggests 
matches based on prefix matches to any tokens in the indexed text. This uses a 
lucene index for it's dictionary. 
        indexPath - When using AnalyzingInfixSuggester you can provide your own 
path where the idnex will get built. The default is 
analyzingInfixSuggesterIndexDir and will be created in your collections data 
directory.
        minPrefixChars - Minimum number of leading characters before 
PrefixQuery is used (default 4). Prefixes shorter than this are indexed as 
character ngrams (increasing index size but making lookups faster).

- BlendedInfixLookupFactory: It is an extension of the AnalyzingInfixSugegster 
providing an additional functionality where the prefix matches across the 
matched documented can be weighted. You can tell is to score higher if a hit is 
closer to the start of the suggestion or vice versa.
        blenderType -  used to calculate weight coefficient using the position 
of the first matching word. 
                linear: weightFieldValue*(1 - 0.10*position)  - Matches to the 
start will be given a higher score (Default)
                reciprocal: weightFieldValue/(1+position)  - Matches to the end 
will be given a higher score.
        numFactor - Factor to multiply the number of searched elements from 
which results will be pruned. Default is 10. 
        indexPath - When using BlendedInfixSuggester you can provide your own 
path where the index will get built. The default directory name is 
blendedInfixSuggesterIndexDir and will be created in your collections data 
directory.
        minPrefixChars - Minimum number of leading characters before 
PrefixQuery is used (default 4). Prefixes shorter than this are indexed as 
character ngrams (increasing index size but making lookups faster).     

- FreeTextSuggesterFactory:  It looks at the last tokens plus the prefix of 
whatever final token the user is typing, if present to predict the most likely 
next token. How many previous tokens that need to be considered can also be 
specified. This suggester would only be used as a fallback, when the primary 
suggester fails to find any suggestions. 
        ngrams - The max number of tokens out of which singles will be make the 
dictionary. The default value is 2. Increasing this would mean you want more 
than the previous 2 tokens to be taken into consideration when making the 
suggestions.

- FSTLookupFactory: An FST based suggester. 
        exactMatchFirst - If true the exact suggestions are returned first, 
even if they are prefixes of other strings in the FST have larger weights.  
Default is true.
        weightBuckets - The number of separate buckets for weights which the 
suggester will use while building it's dictionary.

- TSTLookupFactory: A simple compact ternary trie based lookup.

- WFSTLookupFactory: Weighted automaton representation; an alternative to 
FSTLookup for more fine-grained ranking. WFSTLookup does not use buckets, but 
instead a shortest path algorithm. Note that it expects weights to be whole 
numbers.

- JaspellLookupFactory: A more complex lookup based on a ternary trie from the 
JaSpell(http://jaspell.sourceforge.net/) project.

--------------------------------------------------------------------------------------------------------------------------------------------------------

Dictionary pluggability - The option to choose the dictionary implementation to 
use for their suggesters to consume the input from the search index.

DocumentDictionaryFactory – You need to specify the suggeestion field ('field') 
along with weight ('weightField') and payload('payloadField') fields from their 
search index.
DocumentExpressionFactory – Same as DocumentDictionaryFactory but allows users 
to specify arbitrary expression into the 'weightExpression' tag.
        weightExpression - Specify arbitrary expression used for scoring the 
suggestions. The fields need to be numeric fields. 
HighFrequencyDictionaryFactory – user can specify a suggestion field and 
specify a threshold to prune out less frequent terms.
Input from external files
        threshold - A value between zero and one representing the minimum 
fraction of the total documents where a term should appear in order to be added 
to the lookup dictionary.
FileDictionaryFactory – user can specify a file which contains suggest entries, 
along with weights and payloads. One entry is allowed per line.
        fieldDelimiter - Specify the delimiter to be used seperating the 
entries, weights and payloads. The default is tab.
--------------------------------------------------------------------------------------------------------------------------------------------------------
Using Multiple Suggesters -
You can request multiple suggesters to provide suggestions for the same query - 
Example Syntax - 
localhost:8983/solr/suggest?suggest=true&suggest.dictionary=suggest1&suggest.dictionary=suggest2&suggest.q=python

> Documentation of Suggester V2
> -----------------------------
>
>                 Key: SOLR-5683
>                 URL: https://issues.apache.org/jira/browse/SOLR-5683
>             Project: Solr
>          Issue Type: Task
>          Components: SearchComponents - other
>            Reporter: Areek Zillur
>            Assignee: Areek Zillur
>             Fix For: 4.9, 5.0
>
>
> Place holder for documentation that will eventually end up in the Solr Ref 
> guide.
> ====
> The new Suggester Component allows Solr to fully utilize the Lucene 
> suggesters. 
> The main features are:
> - lookup pluggability (TODO: add description):
>   -- AnalyzingInfixLookupFactory
>   -- AnalyzingLookupFactory
>   -- FuzzyLookupFactory
>   -- FreeTextLookupFactory
>   -- FSTLookupFactory
>   -- WFSTLookupFactory
>   -- TSTLookupFactory
>   --  JaspellLookupFactory
>    - Dictionary pluggability (give users the option to choose the dictionary 
> implementation to use for their suggesters to consume)
>    -- Input from search index
>       --- DocumentDictionaryFactory – user can specify suggestion field along 
> with optional weight and payload fields from their search index.
>       --- DocumentExpressionFactory – same as DocumentDictionaryFactory but 
> allows users to specify arbitrary expression using existing numeric fields.
>      --- HighFrequencyDictionaryFactory – user can specify a suggestion field 
> and specify a threshold to prune out less frequent terms.       
>    -- Input from external files
>      --- FileDictionaryFactory – user can specify a file which contains 
> suggest entries, along with optional weights and payloads.
> Config (index time) options:
>   - name - name of suggester
>   - sourceLocation - external file location (for file-based suggesters)
>   - lookupImpl - type of lookup to use [default JaspellLookupFactory]
>   - dictionaryImpl - type of dictionary to use (lookup input) [default
>     (sourceLocation == null ? HighFrequencyDictionaryFactory : 
> FileDictionaryFactory)]
>   - storeDir - location to store in-memory data structure in disk
>   - buildOnCommit - command to build suggester for every commit
>   - buildOnOptimize - command to build suggester for every optimize
> Query time options:
>   - suggest.dictionary - name of suggester to use (can occur multiple times 
> for batching suggester requests)
>   - suggest.count - number of suggestions to return
>   - suggest.q - query to use for lookup
>   - suggest.build - command to build the suggester
>   - suggest.reload - command to reload the suggester
>   - buildAll – command to build all suggesters in the component
>   - reloadAll – command to reload all suggesters in the component
> Example query:
> {code}
> http://localhost:8983/solr/suggest?suggest.dictionary=suggester1&suggest=true&suggest.build=true&suggest.q=elec
> {code}
> Distributed query:
> {code}
> http://localhost:7574/solr/suggest?suggest.dictionary=suggester2&suggest=true&suggest.build=true&suggest.q=elec&shards=localhost:8983/solr,localhost:7574/solr&shards.qt=/suggest
> {code}        
> Response Format:
> The response format can be either XML or JSON. The typical response structure 
> is as follows:
>  {code}
> {
>   suggest: {
>     suggester_name: {
>        suggest_query: { numFound:  .., suggestions: [ {term: .., weight: .., 
> payload: ..}, .. ]} 
>    }
> }     
> {code}
>   
> Example Response:
> {code}
> {
>     responseHeader: {
>         status: 0,
>         QTime: 3
>     },
>     suggest: {
>         suggester1: {
>             e: {
>                 numFound: 1,
>                 suggestions: [
>                     {
>                         term: "electronics and computer1",
>                         weight: 100,
>                         payload: ""
>                     }
>                 ]
>             }
>         },
>         suggester2: {
>             e: {
>                 numFound: 1,
>                 suggestions: [
>                     {
>                         term: "electronics and computer1",
>                         weight: 10,
>                         payload: ""
>                     }
>                 ]
>             }
>         }
>     }
> }
> {code}
> Example solrconfig snippet with multiple suggester configuration:
> {code}  
>   <searchComponent name="suggest" class="solr.SuggestComponent">
>     <lst name="suggester">
>       <str name="name">suggester1</str>
>       <str name="lookupImpl">FuzzyLookupFactory</str>      
>       <str name="dictionaryImpl">DocumentDictionaryFactory</str>      
>       <str name="field">cat</str>
>       <str name="weightField">price</str>
>       <str name="suggestAnalyzerFieldType">string</str>
>     </lst>
>    <lst name="suggester">
>         <str name="name">suggester2 </str>
>         <str name="dictionaryImpl">DocumentExpressionDictionaryFactory</str>
>         <str name="lookupImpl">FuzzyLookupFactory</str>
>         <str name="field">product_name</str>
>         <str name="weightExpression">((price * 2) + ln(popularity))</str>
>         <str name="sortField">weight</str>
>         <str name="sortField">price</str>
>         <str name="strtoreDir">suggest_fuzzy_doc_expr_dict</str>
>         <str name="suggestAnalyzerFieldType">text</str>
>       </lst>  
> </searchComponent>
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-5683) Documentation of Suggester V2

Reply via email to