This is an automated email from the ASF dual-hosted git repository.
epugh pushed a commit to branch branch_9x
in repository https://gitbox.apache.org/repos/asf/solr.git
The following commit(s) were added to refs/heads/branch_9x by this push:
new e800cc5f191 Update indexing docs in Solr ref guide (#1961)
e800cc5f191 is described below
commit e800cc5f1918418eee877e7f483c46b8d1ae19c5
Author: Andrey Bozhko <[email protected]>
AuthorDate: Tue Oct 3 17:51:16 2023 -0500
Update indexing docs in Solr ref guide (#1961)
Wide variety of fixes to the docs related to indexing to account for
evolution of Solr 9x and Lucene 9x
---------
Co-authored-by: Andrey Bozhko <[email protected]>
Co-authored-by: Eric Pugh <[email protected]>
---
.../modules/indexing-guide/indexing-nav.adoc | 2 +-
.../modules/indexing-guide/pages/analyzers.adoc | 2 +-
.../{charfilterfactories.adoc => charfilters.adoc} | 2 +-
.../indexing-guide/pages/document-analysis.adoc | 2 +-
.../field-type-definitions-and-properties.adoc | 2 +-
.../modules/indexing-guide/pages/filters.adoc | 150 +++++++++---
.../indexing-guide/pages/schema-elements.adoc | 11 +-
.../modules/indexing-guide/pages/tokenizers.adoc | 264 +++++++++++++++++++--
8 files changed, 372 insertions(+), 63 deletions(-)
diff --git a/solr/solr-ref-guide/modules/indexing-guide/indexing-nav.adoc
b/solr/solr-ref-guide/modules/indexing-guide/indexing-nav.adoc
index 532f56eb10c..e78233d4a5d 100644
--- a/solr/solr-ref-guide/modules/indexing-guide/indexing-nav.adoc
+++ b/solr/solr-ref-guide/modules/indexing-guide/indexing-nav.adoc
@@ -43,7 +43,7 @@
** xref:analyzers.adoc[]
** xref:tokenizers.adoc[]
** xref:filters.adoc[]
-** xref:charfilterfactories.adoc[]
+** xref:charfilters.adoc[]
** xref:language-analysis.adoc[]
** xref:phonetic-matching.adoc[]
** xref:analysis-screen.adoc[]
diff --git a/solr/solr-ref-guide/modules/indexing-guide/pages/analyzers.adoc
b/solr/solr-ref-guide/modules/indexing-guide/pages/analyzers.adoc
index 280f9289ba3..876f15dc367 100644
--- a/solr/solr-ref-guide/modules/indexing-guide/pages/analyzers.adoc
+++ b/solr/solr-ref-guide/modules/indexing-guide/pages/analyzers.adoc
@@ -180,7 +180,7 @@ For most use cases, this provides the best possible
behavior, but if you wish fo
</analyzer>
<!-- No analysis at all when doing queries that involved Multi-Term
expansion -->
<analyzer type="multiterm">
- <tokenizer class="solr.KeywordTokenizerFactory" />
+ <tokenizer name="keyword" />
</analyzer>
</fieldType>
----
diff --git
a/solr/solr-ref-guide/modules/indexing-guide/pages/charfilterfactories.adoc
b/solr/solr-ref-guide/modules/indexing-guide/pages/charfilters.adoc
similarity index 99%
rename from
solr/solr-ref-guide/modules/indexing-guide/pages/charfilterfactories.adoc
rename to solr/solr-ref-guide/modules/indexing-guide/pages/charfilters.adoc
index f20923fc034..abcf27c537d 100644
--- a/solr/solr-ref-guide/modules/indexing-guide/pages/charfilterfactories.adoc
+++ b/solr/solr-ref-guide/modules/indexing-guide/pages/charfilters.adoc
@@ -1,4 +1,4 @@
-= CharFilterFactories
+= CharFilters
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
diff --git
a/solr/solr-ref-guide/modules/indexing-guide/pages/document-analysis.adoc
b/solr/solr-ref-guide/modules/indexing-guide/pages/document-analysis.adoc
index 5b56449c4d3..37d94f2ec63 100644
--- a/solr/solr-ref-guide/modules/indexing-guide/pages/document-analysis.adoc
+++ b/solr/solr-ref-guide/modules/indexing-guide/pages/document-analysis.adoc
@@ -50,7 +50,7 @@ It also serves as a guide so that you can configure your own
analysis classes if
| xref:analyzers.adoc[]: Overview of Solr analyzers.
| xref:tokenizers.adoc[]: Tokenizers and tokenizer factory classes.
| xref:filters.adoc[]: Filters and filter factory classes.
-| xref:charfilterfactories.adoc[]: Filters for pre-processing input characters.
+| xref:charfilters.adoc[]: Filters for pre-processing input characters.
| xref:language-analysis.adoc[]: Tokenizers and filters for character set
conversion and specific languages.
| xref:analysis-screen.adoc[]: Admin UI for testing field analysis.
|===
diff --git
a/solr/solr-ref-guide/modules/indexing-guide/pages/field-type-definitions-and-properties.adoc
b/solr/solr-ref-guide/modules/indexing-guide/pages/field-type-definitions-and-properties.adoc
index e415023984e..39f2ee5751f 100644
---
a/solr/solr-ref-guide/modules/indexing-guide/pages/field-type-definitions-and-properties.adoc
+++
b/solr/solr-ref-guide/modules/indexing-guide/pages/field-type-definitions-and-properties.adoc
@@ -201,7 +201,7 @@ The table below includes the default value for most
`FieldType` implementations
|`sortMissingFirst`, `sortMissingLast` |Control the placement of documents
when a sort field is not present. |`false`
|`multiValued` |If `true`, indicates that a single document might contain
multiple values for this field type. |`false`
|`uninvertible` |If `true`, indicates that an `indexed="true"
docValues="false"` field can be "un-inverted" at query time to build up large
in memory data structure to serve in place of xref:docvalues.adoc[]. *Defaults
to `true` for historical reasons, but users are strongly encouraged to set this
to `false` for stability and use `docValues="true"` as needed.* |`true`
-|`omitNorms` |If `true`, omits the norms associated with this field (this
disables length normalization for the field, and saves some memory). *Defaults
to true for all primitive (non-analyzed) field types, such as int, float, data,
bool, and string.* Only full-text fields or fields need norms. |*
+|`omitNorms` |If `true`, omits the norms associated with this field (this
disables length normalization for the field, and saves some memory). *Defaults
to true for all primitive (non-analyzed) field types, such as int, float, data,
bool, and string.* Only full-text fields or fields that need an index-time
boost need norms. |*
|`omitTermFreqAndPositions` |If `true`, omits term frequency, positions, and
payloads from postings for this field. This can be a performance boost for
fields that don't require that information. It also reduces the storage space
required for the index. Queries that rely on position that are issued on a
field with this option will silently fail to find documents. *This property
defaults to true for all field types that are not text fields.* |*
|`omitPositions` |Similar to `omitTermFreqAndPositions` but preserves term
frequency information. |*
|`termVectors`, `termPositions`, `termOffsets`, `termPayloads` |These options
instruct Solr to maintain full term vectors for each document, optionally
including position, offset, and payload information for each term occurrence in
those vectors. These can be used to accelerate highlighting and other ancillary
functionality, but impose a substantial cost in terms of index size. They are
not necessary for typical uses of Solr. |`false`
diff --git a/solr/solr-ref-guide/modules/indexing-guide/pages/filters.adoc
b/solr/solr-ref-guide/modules/indexing-guide/pages/filters.adoc
index b9089f0c8d1..fa2c3365a56 100644
--- a/solr/solr-ref-guide/modules/indexing-guide/pages/filters.adoc
+++ b/solr/solr-ref-guide/modules/indexing-guide/pages/filters.adoc
@@ -263,8 +263,7 @@ The value `auto` will allow the filter to identify the
language, or a comma-sepa
----
<analyzer>
<tokenizer name="standard"/>
- <filter name="beiderMorse" nameType="GENERIC" ruleType="APPROX"
concat="true" languageSet="auto">
- </filter>
+ <filter name="beiderMorse" nameType="GENERIC" ruleType="APPROX"
concat="true" languageSet="auto"/>
</analyzer>
----
====
@@ -275,8 +274,7 @@ The value `auto` will allow the filter to identify the
language, or a comma-sepa
----
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
- <filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC"
ruleType="APPROX" concat="true" languageSet="auto">
- </filter>
+ <filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC"
ruleType="APPROX" concat="true" languageSet="auto"/>
</analyzer>
----
====
@@ -327,7 +325,7 @@ This filter takes the output of the
xref:tokenizers.adoc#classic-tokenizer[Class
== Common Grams Filter
This filter for use in `index` time analysis creates word shingles by
combining common tokens such as stop words with regular tokens.
-This can result in an index with more unique terms, but is useful for creating
phrase queries containing common words, such as "the cat", in a way that will
typically be much faster then if the combined tokens are not used, because only
the term positions of documents containg both terms in sequence have to be
considered.
+This can result in an index with more unique terms, but is useful for creating
phrase queries containing common words, such as "the cat", in a way that will
typically be much faster than if the combined tokens are not used, because only
the term positions of documents containing both terms in sequence have to be
considered.
Correct usage requires being paired with <<Common Grams Query Filter>> during
`query` analysis.
These filters can also be combined with <<Stop Filter>> so searching for `"the
cat"` would match different documents then `"a cat"`, while pathological
searches for either `"the"` or `"a"` would not match any documents.
@@ -409,7 +407,7 @@ If `true`, the filter ignores the case of words when
comparing them to the commo
== Common Grams Query Filter
-This filter is used for the `query` time analysis aspect of <<Common Grams
Filter>> -- see that filer for a description of arguments, example
configuration, and sample input/output.
+This filter is used for the `query` time analysis aspect of <<Common Grams
Filter>> -- see that filter for a description of arguments, example
configuration, and sample input/output.
== Collation Key Filter
@@ -580,8 +578,8 @@ The character used to separate the token and the boost.
[source,xml]
----
<analyzer>
-<tokenizer name="standard"/>
-<filter name="delimitedBoost"/>
+ <tokenizer name="standard"/>
+ <filter name="delimitedBoost"/>
</analyzer>
----
====
@@ -591,8 +589,8 @@ The character used to separate the token and the boost.
[source,xml]
----
<analyzer>
-<tokenizer class="solr.StandardTokenizerFactory"/>
-<filter class="solr.DelimitedBoostTokenFilterFactory"/>
+ <tokenizer class="solr.StandardTokenizerFactory"/>
+ <filter class="solr.DelimitedBoostTokenFilterFactory"/>
</analyzer>
----
====
@@ -613,8 +611,8 @@ Using a different delimiter (`delimiter="/"`).
[source,xml]
----
<analyzer>
-<tokenizer name="standard"/>
-<filter name="delimitedBoost" delimiter="/"/>
+ <tokenizer name="standard"/>
+ <filter name="delimitedBoost" delimiter="/"/>
</analyzer>
----
@@ -638,17 +636,17 @@ This filter generates edge n-gram tokens of sizes within
the given range.
+
[%autowidth,frame=none]
|===
-|Optional |Default: `1`
+|Required |Default: none
|===
-The minimum gram size.
+The minimum gram size, must be > 0.
`maxGramSize`::
+
[%autowidth,frame=none]
|===
-|Optional |Default: `1`
+|Required |Default: none
|===
-The maximum gram size.
+The maximum gram size, must be >= `minGramSize`.
`preserveOriginal`::
+
@@ -672,7 +670,7 @@ Default behavior.
----
<analyzer>
<tokenizer name="standard"/>
- <filter name="edgeNGram"/>
+ <filter name="edgeNGram" minGramSize="1" maxGramSize="1"/>
</analyzer>
----
====
@@ -683,7 +681,7 @@ Default behavior.
----
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
- <filter class="solr.EdgeNGramFilterFactory"/>
+ <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="1"/>
</analyzer>
----
====
@@ -947,6 +945,15 @@ The path of a rules file.
+
Controls whether matching is case sensitive or not.
+`longestOnly`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: `false`
+|===
++
+If `true`, only the longest term is emitted.
+
`strictAffixParsing`::
+
[%autowidth,frame=none]
@@ -1108,7 +1115,7 @@ For detailed information on this normalization form, see
http://www.unicode.org/
== ICU Normalizer 2 Filter
-This filter factory normalizes text according to one of five Unicode
Normalization Forms as described in http://unicode.org/reports/tr15/[Unicode
Standard Annex #15]:
+This filter normalizes text according to one of five Unicode Normalization
Forms as described in http://unicode.org/reports/tr15/[Unicode Standard Annex
#15]:
* NFC: (`name="nfc" mode="compose"`) Normalization Form C, canonical
decomposition
* NFD: (`name="nfc" mode="decompose"`) Normalization Form D, canonical
decomposition, followed by canonical composition
@@ -1214,6 +1221,15 @@ s|Required |Default: none
The identifier for the ICU System Transform you wish to apply with this filter.
For a full list of ICU System Transforms, see
http://demo.icu-project.org/icu-bin/translit?TEMPLATE_FILE=data/translit_rule_main.html.
+`direction`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `forward`
+|===
++
+The direction of the ICU transform. Valid options are `forward` and `reverse`.
+
*Example:*
[.dynamic-tabs]
@@ -1268,6 +1284,15 @@ Path to a text file containing the list of keep words,
one per line.
Blank lines and lines that begin with `\#` are ignored.
This may be an absolute path, or a simple filename in the Solr `conf`
directory.
+`format`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: none
+|===
++
+If the keepwords list has been formatted for Snowball, you can specify
`format="snowball"` so Solr can read the keepwords file.
+
`ignoreCase`::
+
[%autowidth,frame=none]
@@ -1467,12 +1492,14 @@ Tokens longer than this are discarded.
This filter limits the number of accepted tokens, typically useful for index
analysis.
By default, this filter ignores any tokens in the wrapped `TokenStream` once
the limit has been reached, which can result in `reset()` being called prior to
`incrementToken()` returning `false`.
-For most `TokenStream` implementations this should be acceptable, and faster
then consuming the full stream.
+For most `TokenStream` implementations this should be acceptable, and faster
than consuming the full stream.
If you are wrapping a `TokenStream` which requires that the full stream of
tokens be exhausted in order to function properly, use the
`consumeAllTokens="true"` option.
*Factory class:* `solr.LimitTokenCountFilterFactory`
*Arguments:*
+
+`maxTokenCount`::
+
[%autowidth,frame=none]
|===
@@ -1534,14 +1561,21 @@ This filter limits tokens to those before a configured
maximum start character o
This can be useful to limit highlighting, for example.
By default, this filter ignores any tokens in the wrapped `TokenStream` once
the limit has been reached, which can result in `reset()` being called prior to
`incrementToken()` returning `false`.
-For most `TokenStream` implementations this should be acceptable, and faster
then consuming the full stream.
+For most `TokenStream` implementations this should be acceptable, and faster
than consuming the full stream.
If you are wrapping a `TokenStream` which requires that the full stream of
tokens be exhausted in order to function properly, use the
`consumeAllTokens="true"` option.
*Factory class:* `solr.LimitTokenOffsetFilterFactory`
*Arguments:*
-`maxStartOffset`:: (integer, required) Maximum token start character offset.
+`maxStartOffset`::
++
+[%autowidth,frame=none]
+|===
+s|Required |Default: none
+|===
++
+Maximum token start character offset.
After this limit has been reached, tokens are discarded.
`consumeAllTokens`::
@@ -1595,7 +1629,7 @@ See description above.
This filter limits tokens to those before a configured maximum token position.
By default, this filter ignores any tokens in the wrapped `TokenStream` once
the limit has been reached, which can result in `reset()` being called prior to
`incrementToken()` returning `false`.
-For most `TokenStream` implementations this should be acceptable, and faster
then consuming the full stream.
+For most `TokenStream` implementations this should be acceptable, and faster
than consuming the full stream.
If you are wrapping a `TokenStream` which requires that the full stream of
tokens be exhausted in order to function properly, use the
`consumeAllTokens="true"` option.
*Factory class:* `solr.LimitTokenPositionFilterFactory`
@@ -1628,7 +1662,7 @@ See description above.
--
[example.tab-pane#byname-filter-limittokenposition]
====
-[.tab-label]*With name)*
+[.tab-label]*With name*
[source,xml]
----
<analyzer>
@@ -1832,7 +1866,7 @@ This filter would normally be preceded by a <<Shingle
Filter>>, as shown in the
Each input token is hashed.
It is subsequently "rehashed" `hashCount` times by combining with a set of
precomputed hashes.
-For each of the resulting hashes, the hash space is divided in to
`bucketCount` buckets.
+For each of the resulting hashes, the hash space is divided into `bucketCount`
buckets.
The lowest set of `hashSetSize` hashes (usually a set of one) is generated for
each bucket.
This filter generates one type of signature or sketch for the input tokens and
can be used to compute Jaccard similarity between documents.
@@ -1881,6 +1915,24 @@ With the default settings for `withRotation`, the number
of hashes generated is
*Example:*
+[.dynamic-tabs]
+--
+[example.tab-pane#byname-filter-minhash]
+====
+[.tab-label]*With name*
+[source,xml]
+----
+<analyzer>
+ <tokenizer name="icu"/>
+ <filter name="icuFolding"/>
+ <filter name="shingle" minShingleSize="5" outputUnigrams="false"
outputUnigramsIfNoShingles="false" maxShingleSize="5" tokenSeparator=" "/>
+ <filter name="minHash" bucketCount="512" hashSetSize="1" hashCount="1"/>
+</analyzer>
+----
+====
+[example.tab-pane#byclass-filter-minhash]
+====
+[.tab-label]*With class name (legacy)*
[source,xml]
----
<analyzer>
@@ -1890,6 +1942,8 @@ With the default settings for `withRotation`, the number
of hashes generated is
<filter class="org.apache.lucene.analysis.minhash.MinHashFilterFactory"
bucketCount="512" hashSetSize="1" hashCount="1"/>
</analyzer>
----
+====
+--
*In:* "woof woof woof woof woof"
@@ -1910,18 +1964,18 @@ Note that tokens are ordered by position and then by
gram size.
+
[%autowidth,frame=none]
|===
-|Optional |Default: `1`
+|Required |Default: none
|===
-The minimum gram size.
+The minimum gram size, must be > 0.
`maxGramSize`::
+
[%autowidth,frame=none]
|===
-|Optional |Default: `2`
+|Required |Default: none
|===
+
-The maximum gram size.
+The maximum gram size, must be >= `minGramSize`.
`preserveOriginal`::
+
@@ -1945,7 +1999,7 @@ Default behavior.
----
<analyzer>
<tokenizer name="standard"/>
- <filter name="nGram"/>
+ <filter name="nGram" minGramSize="1" maxGramSize="2"/>
</analyzer>
----
====
@@ -1956,7 +2010,7 @@ Default behavior.
----
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
- <filter class="solr.NGramFilterFactory"/>
+ <filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="2"/>
</analyzer>
----
====
@@ -2087,7 +2141,7 @@ Tokens with a matching type name will have their payload
set to the above floati
== Pattern Replace Filter
This filter applies a regular expression to each token and, for those that
match, substitutes the given replacement string in place of the matched pattern.
-Tokens which do not match are passed though unchanged.
+Tokens which do not match are passed through unchanged.
*Factory class:* `solr.PatternReplaceFilterFactory`
@@ -3096,10 +3150,10 @@ If `false`, all equivalent synonyms will be reduced to
the first in the list.
+
[%autowidth,frame=none]
|===
-|Optional |Default: none
+|Optional |Default: `solr`
|===
+
-(optional; default: `solr`) Controls how the synonyms will be parsed.
+Controls how the synonyms will be parsed.
The short names `solr` (for
{lucene-javadocs}/analysis/common/org/apache/lucene/analysis/synonym/SolrSynonymParser.html[`SolrSynonymParser`]
and `wordnet` (for
{lucene-javadocs}/analysis/common/org/apache/lucene/analysis/synonym/WordnetSynonymParser.html[`WordnetSynonymParser`]
) are supported.
You may alternatively supply the name of your own
{lucene-javadocs}/analysis/common/org/apache/lucene/analysis/synonym/SynonymMap.Builder.html[`SynonymMap.Builder`]
subclass.
@@ -3368,6 +3422,25 @@ This filter adds the token's type, as a token at the
same position as the token,
+
The prefix to prepend to the token's type.
+`ignore`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: none
+|===
++
+A comma-separated list of types to ignore and not convert to synonyms.
+
+`synFlagsMask`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: see description
+|===
++
+A mask (provided as an integer) to control what flags are propagated to the
synonyms.
+The default value is an integer `-1`, i.e., the mask `0xFFFFFFFF` - this mask
propagates any flags as is.
+
*Examples:*
With the example below, each token's type will be emitted verbatim at the same
position:
@@ -3614,6 +3687,15 @@ The path to a file that contains a list of protected
words that should be passed
+
If `1`, strips the possessive `'s` from each subword.
+`adjustOffsets`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: `true`
+|===
++
+If `true`, the offsets of partial terms are adjusted.
+
`types`::
+
[%autowidth,frame=none]
diff --git
a/solr/solr-ref-guide/modules/indexing-guide/pages/schema-elements.adoc
b/solr/solr-ref-guide/modules/indexing-guide/pages/schema-elements.adoc
index abb1b7492cf..ee45819dab2 100644
--- a/solr/solr-ref-guide/modules/indexing-guide/pages/schema-elements.adoc
+++ b/solr/solr-ref-guide/modules/indexing-guide/pages/schema-elements.adoc
@@ -95,18 +95,23 @@ However, `uniqueKey` will continue to work, as long as the
field is properly use
Similarity is a Lucene class used to score a document in searching.
Each collection has one "global" Similarity.
-By default Solr uses an implicit
{solr-javadocs}/core/org/apache/solr/search/similarities/SchemaSimilarityFactory.html[`SchemaSimilarityFactory`]
which allows individual field types to be configured with a "per-type"
specific Similarity and implicitly uses `BM25Similarity` for any field type
which does not have an explicit Similarity.
+By default, Solr uses an implicit
{solr-javadocs}/core/org/apache/solr/search/similarities/SchemaSimilarityFactory.html[`SchemaSimilarityFactory`]
which allows individual field types to be configured with a "per-type"
specific Similarity and implicitly uses `BM25Similarity` for any field type
which does not have an explicit Similarity.
This default behavior can be overridden by declaring a top level
`<similarity/>` element in your schema, outside of any single field type.
This similarity declaration can either refer directly to the name of a class
with a no-argument constructor, such as in this example showing
`BM25Similarity`:
[source,xml]
----
-<similarity class="solr.BM25SimilarityFactory"/>
+<similarity class="org.apache.lucene.search.similarities.BM25Similarity"/>
----
-or by referencing a `SimilarityFactory` implementation, which may take
optional initialization parameters:
+or by referencing a `SimilarityFactory` implementation:
+[source,xml]
+----
+<similarity class="solr.BM25SimilarityFactory"/>
+----
+When using the similarity factory, it is possible to specify optional
initialization parameters:
[source,xml]
----
<similarity class="solr.DFRSimilarityFactory">
diff --git a/solr/solr-ref-guide/modules/indexing-guide/pages/tokenizers.adoc
b/solr/solr-ref-guide/modules/indexing-guide/pages/tokenizers.adoc
index 24d04ec1802..fbb69399efc 100644
--- a/solr/solr-ref-guide/modules/indexing-guide/pages/tokenizers.adoc
+++ b/solr/solr-ref-guide/modules/indexing-guide/pages/tokenizers.adoc
@@ -100,7 +100,7 @@ Arguments may be passed to tokenizer factories by setting
attributes on the `<to
=== When to Use a CharFilter vs. a TokenFilter
-There are several pairs of CharFilters and TokenFilters that have related
(i.e., `MappingCharFilter` and `ASCIIFoldingFilter`) or nearly identical (i.e.,
`PatternReplaceCharFilterFactory` and `PatternReplaceFilterFactory`)
functionality and it may not always be obvious which is the best choice.
+There are several pairs of CharFilters and TokenFilters that have related
(i.e., `MappingCharFilter` and `ASCIIFoldingFilter`) or nearly identical (i.e.,
`PatternReplaceCharFilterFactory` and `PatternReplaceFilterFactory`)
functionality, and it may not always be obvious which is the best choice.
The decision about which to use depends largely on which Tokenizer you are
using, and whether you need to preprocess the stream of characters.
@@ -125,7 +125,14 @@ The Standard Tokenizer supports
http://unicode.org/reports/tr29/#Word_Boundaries
*Arguments:*
-`maxTokenLength`: (integer, default 255) Solr ignores tokens that exceed the
number of characters specified by `maxTokenLength`.
+`maxTokenLength`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `255`
+|===
++
+Solr ignores tokens that exceed the number of characters specified by
`maxTokenLength`.
*Example:*
@@ -174,7 +181,14 @@ Delimiter characters are discarded, with the following
exceptions:
*Arguments:*
-`maxTokenLength`: (integer, default 255) Solr ignores tokens that exceed the
number of characters specified by `maxTokenLength`.
+`maxTokenLength`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `255`
+|===
++
+Solr ignores tokens that exceed the number of characters specified by
`maxTokenLength`.
*Example:*
@@ -212,7 +226,16 @@ This tokenizer treats the entire text field as a single
token.
*Factory class:* `solr.KeywordTokenizerFactory`
-*Arguments:* None
+*Arguments:*
+
+`maxTokenLen`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `256`
+|===
++
+Maximum token length the tokenizer will emit.
*Example:*
@@ -250,7 +273,16 @@ This tokenizer creates tokens from strings of contiguous
letters, discarding all
*Factory class:* `solr.LetterTokenizerFactory`
-*Arguments:* None
+*Arguments:*
+
+`maxTokenLen`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `255`
+|===
++
+Maximum token length the tokenizer will emit.
*Example:*
@@ -289,7 +321,16 @@ Whitespace and non-letters are discarded.
*Factory class:* `solr.LowerCaseTokenizerFactory`
-*Arguments:* None
+*Arguments:*
+
+`maxTokenLen`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `255`
+|===
++
+Maximum token length the tokenizer will emit.
*Example:*
@@ -329,9 +370,23 @@ Reads the field text and generates n-gram tokens of sizes
in the given range.
*Arguments:*
-`minGramSize`: (integer, default 1) The minimum n-gram size, must be > 0.
+`minGramSize`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `1`
+|===
++
+The minimum n-gram size, must be > 0.
-`maxGramSize`: (integer, default 2) The maximum n-gram size, must be >=
`minGramSize`.
+`maxGramSize`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `2`
+|===
++
+The maximum n-gram size, must be >= `minGramSize`.
*Example:*
@@ -408,9 +463,23 @@ Reads the field text and generates edge n-gram tokens of
sizes in the given rang
*Arguments:*
-`minGramSize`: (integer, default is 1) The minimum n-gram size, must be > 0.
+`minGramSize`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `1`
+|===
++
+The minimum n-gram size, must be > 0.
-`maxGramSize`: (integer, default is 1) The maximum n-gram size, must be >=
`minGramSize`.
+`maxGramSize`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `1`
+|===
++
+The maximum n-gram size, must be >= `minGramSize`.
*Example:*
@@ -490,7 +559,33 @@ The default configuration for `solr.ICUTokenizerFactory`
provides UAX#29 word br
*Arguments:*
-`rulefile`: a comma-separated list of `code:rulefile` pairs in the following
format: four-letter ISO 15924 script code, followed by a colon, then a resource
path.
+`rulefile`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: none
+|===
++
+A comma-separated list of `code:rulefile` pairs in the following format:
four-letter ISO 15924 script code, followed by a colon, then a resource path.
+
+`cjkAsWords`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `true`
+|===
++
+If `true`, CJK text would undergo dictionary-based segmentation, and all
Han+Hiragana+Katakana words will be tagged as IDEOGRAPHIC.
+Otherwise, text will be segmented according to UAX#29 defaults.
+
+`myanmarAsWords`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `true`
+|===
++
+If `true`, Myanmar text would undergo dictionary-based segmentation, otherwise
it will be tokenized as syllables.
*Example:*
@@ -562,13 +657,47 @@ This tokenizer creates synonyms from file path
hierarchies.
*Arguments:*
-`delimiter`: (character, no default) You can specify the file path delimiter
and replace it with a delimiter you provide.
+`delimiter`::
++
+[%autowidth,frame=none]
+|===
+s|Required |Default: none
+|===
++
+You can specify the file path delimiter and replace it with a delimiter you
provide.
This can be useful for working with backslash delimiters.
-`replace`: (character, no default) Specifies the delimiter character Solr uses
in the tokenized output.
+`replace`::
++
+[%autowidth,frame=none]
+|===
+s|Required |Default: none
+|===
++
+Specifies the delimiter character Solr uses in the tokenized output.
+
+`reverse`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `false`
+|===
++
+If `true`, switch the tokenizer behavior to build the path hierarchy in
"reversed" order.
+This is typically useful for tokenizing the URLs.
+
+`skip`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `0`
+|===
++
+Number of leftmost (or rightmost, if reverse=true) path elements to drop from
each emitted token.
*Example:*
+Default behavior
[.dynamic-tabs]
--
[example.tab-pane#byname-tokenizer-pathhierarchy]
@@ -601,6 +730,41 @@ This can be useful for working with backslash delimiters.
*Out:* "c:", "c:/usr", "c:/usr/local", "c:/usr/local/apache"
+*Example:*
+
+Reverse order
+[.dynamic-tabs]
+--
+[example.tab-pane#byname-tokenizer-pathhierarchy-reversed]
+====
+[.tab-label]*With name*
+[source,xml]
+----
+<fieldType name="text_path" class="solr.TextField" positionIncrementGap="100">
+ <analyzer>
+ <tokenizer name="pathHierarchy" delimiter="." replace="." reverse="true"/>
+ </analyzer>
+</fieldType>
+----
+====
+[example.tab-pane#byclass-tokenizer-pathhierarchy-reversed]
+====
+[.tab-label]*With class name (legacy)*
+[source,xml]
+----
+<fieldType name="text_path" class="solr.TextField" positionIncrementGap="100">
+ <analyzer>
+ <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="."
replace="." reverse="true"/>
+ </analyzer>
+</fieldType>
+----
+====
+--
+
+*In:* "www.site.co.uk"
+
+*Out:* "www.site.co.uk", "site.co.uk", "co.uk", "uk"
+
== Regular Expression Pattern Tokenizer
This tokenizer uses a Java regular expression to break the input text stream
into tokens.
@@ -612,9 +776,23 @@ See {java-javadocs}java/util/regex/Pattern.html[the
Javadocs for `java.util.rege
*Arguments:*
-`pattern`: (Required) The regular expression, as defined by in
`java.util.regex.Pattern`.
+`pattern`::
++
+[%autowidth,frame=none]
+|===
+s|Required |Default: none
+|===
++
+The regular expression, as defined by in `java.util.regex.Pattern`.
-`group`: (Optional, default -1) Specifies which regex group to extract as the
token(s).
+`group`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `-1`
+|===
++
+Specifies which regex group to extract as the token(s).
The value -1 means the regex should be treated as a delimiter that separates
tokens.
Non-negative group numbers (>= 0) indicate that character sequences matching
that regex group should be converted to tokens.
Group zero refers to the entire regex, groups greater than zero refer to
parenthesized sub-expressions of the regex, counted from left to right.
@@ -687,7 +865,7 @@ A sequence of at least one capital letter followed by zero
or more letters of ei
*Example:*
-Extract part numbers which are preceded by "SKU", "Part" or "Part Number",
case sensitive, with an optional semi-colon separator.
+Extract part numbers which are preceded by "SKU", "Part" or "Part Number",
case sensitive, with an optional semicolon separator.
Part numbers must be all numeric digits, with an optional hyphen.
Regex capture groups are numbered by counting left parenthesis from left to
right.
Group 3 is the subexpression "[0-9-]+", which matches one or more digits or
hyphens.
@@ -729,11 +907,25 @@ The syntax is more limited than
`PatternTokenizerFactory`, but the tokenization
*Arguments:*
-`pattern`: (Required) The regular expression, as defined by in the
{lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`]
javadocs, identifying the characters to include in tokens.
+`pattern`::
++
+[%autowidth,frame=none]
+|===
+s|Required |Default: none
+|===
++
+The regular expression, as defined in the
{lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`]
javadocs, identifying the characters to include in tokens.
The matching is greedy such that the longest token matching at a given point
is created.
Empty tokens are never created.
-`maxDeterminizedStates`: (Optional, default 10000) the limit on total state
count for the determined automaton computed from the regexp.
+`determinizeWorkLimit`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `10000`
+|===
++
+The limit on total state count for the determined automaton computed from the
regexp.
*Example:*
@@ -772,11 +964,25 @@ The syntax is more limited than
`PatternTokenizerFactory`, but the tokenization
*Arguments:*
-`pattern`: (Required) The regular expression, as defined by in the
{lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`]
javadocs, identifying the characters that should split tokens.
+`pattern`::
++
+[%autowidth,frame=none]
+|===
+s|Required |Default: none
+|===
++
+The regular expression, as defined by in the
{lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`]
javadocs, identifying the characters that should split tokens.
The matching is greedy such that the longest token separator matching at a
given point is matched.
Empty tokens are never created.
-`maxDeterminizedStates`: (Optional, default 10000) the limit on total state
count for the determined automaton computed from the regexp.
+`determinizeWorkLimit`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `10000`
+|===
++
+The limit on total state count for the determined automaton computed from the
regexp.
*Example:*
@@ -827,7 +1033,14 @@ The UAX29 URL Email Tokenizer supports
http://unicode.org/reports/tr29/#Word_Bou
*Arguments:*
-`maxTokenLength`: (integer, default 255) Solr ignores tokens that exceed the
number of characters specified by `maxTokenLength`.
+`maxTokenLength`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `255`
+|===
++
+Solr ignores tokens that exceed the number of characters specified by
`maxTokenLength`.
*Example:*
@@ -881,6 +1094,15 @@ Valid values:
* `java`: Uses
{java-javadocs}java/lang/Character.html#isWhitespace-int-[Character.isWhitespace(int)]
* `unicode`: Uses Unicode's WHITESPACE property
+`maxTokenLen`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `255`
+|===
++
+Maximum token length the tokenizer will emit.
+
*Example:*
[.dynamic-tabs]