[
https://issues.apache.org/jira/browse/SOLR-5683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14103562#comment-14103562
]
Varun Thacker commented on SOLR-5683:
-------------------------------------
First draft at documenting the suggesters - This covers all the documentation
wrt suggesters under "Issues from CHANGES.txt that were never doc'ed as part of
their release:" in the
https://cwiki.apache.org/confluence/display/solr/Internal+-+TODO+List link.
bq. DocumentDictionaryFactory – user can specify suggestion field along with
optional weight and payload fields from their search index.
Looking at the code of DocumentDictionaryFactory the weight field is not
optional.
--------------------------------------------------------------------------------------------------------------------------------------------------------
field - The field from which the suggesters dictionary will be populated.
weightField - The field from which the suggestions weight will be populated.
This should be a numeric field. Suggestions will be sorted based on the value
as this is the sole criteria for relevance.
payloadField - Accompanying payload for each suggestion that gets built.
suggestAnalyzerFieldType - Specify the analyzer to be used for the suggester.
The "index" analyzer of this fieldType will be used to build the suggest
dictionary and the "query" analyzer will be used during querying.
Config (index time) options:
name - Name of suggester. This is optional if you have only one suggester
defined.
sourceLocation - External file location for file-based suggesters only.
lookupImpl - Type of lookup to use whose default is JaspellLookupFactory. A
table below lists all the various lookup implementations present.
dictionaryImpl - The type of dictionary to be used when building the suggester.
The default is FileDictionaryFactory for a file-based suggester and it defaults
to HighFrequencyDictionaryFactory otherwise.
storeDir - Location to store the dictionary on disk.
buildOnCommit - Command to build suggester automatically after every commit
that is called. Useful if you want to keep the suggester in sync with your
latest data.
buildOnOptimize - Command to build suggester automatically after every optimize
that is called. Useful if you want to keep the suggester in sync with your
latest data.
Query time options:
suggest.dictionary - name of suggester to use
suggest.count - number of suggestions to return
suggest.q - query to use for lookup
suggest.build - command to build the suggester
suggest.reload - command to reload the suggester
buildAll – command to build all suggesters in the component
reloadAll – command to reload all suggesters in the component
--------------------------------------------------------------------------------------------------------------------------------------------------------
Lookup Implementation Options -
- AnalyzingLookupFactory: Suggester that first analyzes the incoming text and
adds the analyzed form to a weighted FST, and then does the same thing at
lookup time.
suggestAnalyzerFieldType - The analyzer used at "query-time" and
"build-time" to analyze suggestions.
exactMatchFirst - If true the exact suggestions are returned first,
even if they are prefixes of other strings in the FST have larger weights.
Default is true.
preserveSep - If true then a separator between tokens is preserved.
This means that suggestions are sensitive to tokenization (e.g. baseball is
different from base ball. Default is true.
preservePositionIncrements - Whether the suggester should preserve
position increments. What this means is that token filters which leave gaps
(for example when StopFilter matches a stopword) the position would be
respected when building the suggester. The default is false.
- FuzzyLookupFactory: This is a suggester which is an extension of the
AnalyzingSuggester but is fuzzy in nature. The similarity is measured by the
Levenshtein algorithm.
exactMatchFirst - If true the exact suggestions are returned first,
even if they are prefixes of other strings in the FST have larger weights.
Default is true.
preserveSep - If true then a separator between tokens is preserved.
This means that suggestions are sensitive to tokenization (e.g. baseball is
different from base ball. Default is true.
maxSurfaceFormsPerAnalyzedForm - Maximum number of surface forms to
keep for a single analyzed form. When there are too many surface forms we
discard the lowest weighted ones.
maxGraphExpansions - When building the FST ("index-time"), we add each
path through the tokenstream graph as an individual entry. This places an
upper-bound on how many expansions will be added for a single suggestion. The
default is -1 which means there is no limit.
preservePositionIncrements - Whether the suggester should preserve
position increments. What this means is that token filters which leave gaps
(for example when StopFilter matches a stopword) the position would be
respected when building the suggester. The default is false.
maxEdits - Maximum number of string edits allowed. The systems hard
limit is 2. The default is 1.
transpositions - If true transpositions should be treated as a
primitive edit operation. The default is true.
nonFuzzyPrefix - The length of the common non fuzzy prefix match which
must match a suggestion. The default is 1.
minFuzzyLength - The minimum length of query before which any string
edits will be allowed. The default is 3.
unicodeAware - Measure maxEdits, minFuzzyLength, transpositions and
nonFuzzyPrefix parameters in unicode code points (actual letters) instead of
bytes. The default is false.
- AnalyzingInfixSuggesterFactory: Analyzes the input text and then suggests
matches based on prefix matches to any tokens in the indexed text. This uses a
lucene index for it's dictionary.
indexPath - When using AnalyzingInfixSuggester you can provide your own
path where the idnex will get built. The default is
analyzingInfixSuggesterIndexDir and will be created in your collections data
directory.
minPrefixChars - Minimum number of leading characters before
PrefixQuery is used (default 4). Prefixes shorter than this are indexed as
character ngrams (increasing index size but making lookups faster).
- BlendedInfixLookupFactory: It is an extension of the AnalyzingInfixSugegster
providing an additional functionality where the prefix matches across the
matched documented can be weighted. You can tell is to score higher if a hit is
closer to the start of the suggestion or vice versa.
blenderType - used to calculate weight coefficient using the position
of the first matching word.
linear: weightFieldValue*(1 - 0.10*position) - Matches to the
start will be given a higher score (Default)
reciprocal: weightFieldValue/(1+position) - Matches to the end
will be given a higher score.
numFactor - Factor to multiply the number of searched elements from
which results will be pruned. Default is 10.
indexPath - When using BlendedInfixSuggester you can provide your own
path where the index will get built. The default directory name is
blendedInfixSuggesterIndexDir and will be created in your collections data
directory.
minPrefixChars - Minimum number of leading characters before
PrefixQuery is used (default 4). Prefixes shorter than this are indexed as
character ngrams (increasing index size but making lookups faster).
- FreeTextSuggesterFactory: It looks at the last tokens plus the prefix of
whatever final token the user is typing, if present to predict the most likely
next token. How many previous tokens that need to be considered can also be
specified. This suggester would only be used as a fallback, when the primary
suggester fails to find any suggestions.
ngrams - The max number of tokens out of which singles will be make the
dictionary. The default value is 2. Increasing this would mean you want more
than the previous 2 tokens to be taken into consideration when making the
suggestions.
- FSTLookupFactory: An FST based suggester.
exactMatchFirst - If true the exact suggestions are returned first,
even if they are prefixes of other strings in the FST have larger weights.
Default is true.
weightBuckets - The number of separate buckets for weights which the
suggester will use while building it's dictionary.
- TSTLookupFactory: A simple compact ternary trie based lookup.
- WFSTLookupFactory: Weighted automaton representation; an alternative to
FSTLookup for more fine-grained ranking. WFSTLookup does not use buckets, but
instead a shortest path algorithm. Note that it expects weights to be whole
numbers.
- JaspellLookupFactory: A more complex lookup based on a ternary trie from the
JaSpell(http://jaspell.sourceforge.net/) project.
--------------------------------------------------------------------------------------------------------------------------------------------------------
Dictionary pluggability - The option to choose the dictionary implementation to
use for their suggesters to consume the input from the search index.
DocumentDictionaryFactory – You need to specify the suggeestion field ('field')
along with weight ('weightField') and payload('payloadField') fields from their
search index.
DocumentExpressionFactory – Same as DocumentDictionaryFactory but allows users
to specify arbitrary expression into the 'weightExpression' tag.
weightExpression - Specify arbitrary expression used for scoring the
suggestions. The fields need to be numeric fields.
HighFrequencyDictionaryFactory – user can specify a suggestion field and
specify a threshold to prune out less frequent terms.
Input from external files
threshold - A value between zero and one representing the minimum
fraction of the total documents where a term should appear in order to be added
to the lookup dictionary.
FileDictionaryFactory – user can specify a file which contains suggest entries,
along with weights and payloads. One entry is allowed per line.
fieldDelimiter - Specify the delimiter to be used seperating the
entries, weights and payloads. The default is tab.
--------------------------------------------------------------------------------------------------------------------------------------------------------
Using Multiple Suggesters -
You can request multiple suggesters to provide suggestions for the same query -
Example Syntax -
localhost:8983/solr/suggest?suggest=true&suggest.dictionary=suggest1&suggest.dictionary=suggest2&suggest.q=python
> Documentation of Suggester V2
> -----------------------------
>
> Key: SOLR-5683
> URL: https://issues.apache.org/jira/browse/SOLR-5683
> Project: Solr
> Issue Type: Task
> Components: SearchComponents - other
> Reporter: Areek Zillur
> Assignee: Areek Zillur
> Fix For: 4.9, 5.0
>
>
> Place holder for documentation that will eventually end up in the Solr Ref
> guide.
> ====
> The new Suggester Component allows Solr to fully utilize the Lucene
> suggesters.
> The main features are:
> - lookup pluggability (TODO: add description):
> -- AnalyzingInfixLookupFactory
> -- AnalyzingLookupFactory
> -- FuzzyLookupFactory
> -- FreeTextLookupFactory
> -- FSTLookupFactory
> -- WFSTLookupFactory
> -- TSTLookupFactory
> -- JaspellLookupFactory
> - Dictionary pluggability (give users the option to choose the dictionary
> implementation to use for their suggesters to consume)
> -- Input from search index
> --- DocumentDictionaryFactory – user can specify suggestion field along
> with optional weight and payload fields from their search index.
> --- DocumentExpressionFactory – same as DocumentDictionaryFactory but
> allows users to specify arbitrary expression using existing numeric fields.
> --- HighFrequencyDictionaryFactory – user can specify a suggestion field
> and specify a threshold to prune out less frequent terms.
> -- Input from external files
> --- FileDictionaryFactory – user can specify a file which contains
> suggest entries, along with optional weights and payloads.
> Config (index time) options:
> - name - name of suggester
> - sourceLocation - external file location (for file-based suggesters)
> - lookupImpl - type of lookup to use [default JaspellLookupFactory]
> - dictionaryImpl - type of dictionary to use (lookup input) [default
> (sourceLocation == null ? HighFrequencyDictionaryFactory :
> FileDictionaryFactory)]
> - storeDir - location to store in-memory data structure in disk
> - buildOnCommit - command to build suggester for every commit
> - buildOnOptimize - command to build suggester for every optimize
> Query time options:
> - suggest.dictionary - name of suggester to use (can occur multiple times
> for batching suggester requests)
> - suggest.count - number of suggestions to return
> - suggest.q - query to use for lookup
> - suggest.build - command to build the suggester
> - suggest.reload - command to reload the suggester
> - buildAll – command to build all suggesters in the component
> - reloadAll – command to reload all suggesters in the component
> Example query:
> {code}
> http://localhost:8983/solr/suggest?suggest.dictionary=suggester1&suggest=true&suggest.build=true&suggest.q=elec
> {code}
> Distributed query:
> {code}
> http://localhost:7574/solr/suggest?suggest.dictionary=suggester2&suggest=true&suggest.build=true&suggest.q=elec&shards=localhost:8983/solr,localhost:7574/solr&shards.qt=/suggest
> {code}
> Response Format:
> The response format can be either XML or JSON. The typical response structure
> is as follows:
> {code}
> {
> suggest: {
> suggester_name: {
> suggest_query: { numFound: .., suggestions: [ {term: .., weight: ..,
> payload: ..}, .. ]}
> }
> }
> {code}
>
> Example Response:
> {code}
> {
> responseHeader: {
> status: 0,
> QTime: 3
> },
> suggest: {
> suggester1: {
> e: {
> numFound: 1,
> suggestions: [
> {
> term: "electronics and computer1",
> weight: 100,
> payload: ""
> }
> ]
> }
> },
> suggester2: {
> e: {
> numFound: 1,
> suggestions: [
> {
> term: "electronics and computer1",
> weight: 10,
> payload: ""
> }
> ]
> }
> }
> }
> }
> {code}
> Example solrconfig snippet with multiple suggester configuration:
> {code}
> <searchComponent name="suggest" class="solr.SuggestComponent">
> <lst name="suggester">
> <str name="name">suggester1</str>
> <str name="lookupImpl">FuzzyLookupFactory</str>
> <str name="dictionaryImpl">DocumentDictionaryFactory</str>
> <str name="field">cat</str>
> <str name="weightField">price</str>
> <str name="suggestAnalyzerFieldType">string</str>
> </lst>
> <lst name="suggester">
> <str name="name">suggester2 </str>
> <str name="dictionaryImpl">DocumentExpressionDictionaryFactory</str>
> <str name="lookupImpl">FuzzyLookupFactory</str>
> <str name="field">product_name</str>
> <str name="weightExpression">((price * 2) + ln(popularity))</str>
> <str name="sortField">weight</str>
> <str name="sortField">price</str>
> <str name="strtoreDir">suggest_fuzzy_doc_expr_dict</str>
> <str name="suggestAnalyzerFieldType">text</str>
> </lst>
> </searchComponent>
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]