[solr] branch branch_9x updated: Update indexing docs in Solr ref guide (#1961)

epugh Tue, 03 Oct 2023 15:54:10 -0700

This is an automated email from the ASF dual-hosted git repository.

epugh pushed a commit to branch branch_9x
in repository https://gitbox.apache.org/repos/asf/solr.git



The following commit(s) were added to refs/heads/branch_9x by this push:
     new e800cc5f191 Update indexing docs in Solr ref guide (#1961)
e800cc5f191 is described below

commit e800cc5f1918418eee877e7f483c46b8d1ae19c5
Author: Andrey Bozhko <[email protected]>
AuthorDate: Tue Oct 3 17:51:16 2023 -0500

    Update indexing docs in Solr ref guide (#1961)
    
    Wide variety of fixes to the docs related to indexing to account for 
evolution of Solr 9x and Lucene 9x
    ---------
    
    Co-authored-by: Andrey Bozhko <[email protected]>
    Co-authored-by: Eric Pugh <[email protected]>
---
 .../modules/indexing-guide/indexing-nav.adoc       |   2 +-
 .../modules/indexing-guide/pages/analyzers.adoc    |   2 +-
 .../{charfilterfactories.adoc => charfilters.adoc} |   2 +-
 .../indexing-guide/pages/document-analysis.adoc    |   2 +-
 .../field-type-definitions-and-properties.adoc     |   2 +-
 .../modules/indexing-guide/pages/filters.adoc      | 150 +++++++++---
 .../indexing-guide/pages/schema-elements.adoc      |  11 +-
 .../modules/indexing-guide/pages/tokenizers.adoc   | 264 +++++++++++++++++++--
 8 files changed, 372 insertions(+), 63 deletions(-)

diff --git a/solr/solr-ref-guide/modules/indexing-guide/indexing-nav.adoc 
b/solr/solr-ref-guide/modules/indexing-guide/indexing-nav.adoc
index 532f56eb10c..e78233d4a5d 100644
--- a/solr/solr-ref-guide/modules/indexing-guide/indexing-nav.adoc
+++ b/solr/solr-ref-guide/modules/indexing-guide/indexing-nav.adoc
@@ -43,7 +43,7 @@
 ** xref:analyzers.adoc[]
 ** xref:tokenizers.adoc[]
 ** xref:filters.adoc[]
-** xref:charfilterfactories.adoc[]
+** xref:charfilters.adoc[]
 ** xref:language-analysis.adoc[]
 ** xref:phonetic-matching.adoc[]
 ** xref:analysis-screen.adoc[]
diff --git a/solr/solr-ref-guide/modules/indexing-guide/pages/analyzers.adoc 
b/solr/solr-ref-guide/modules/indexing-guide/pages/analyzers.adoc
index 280f9289ba3..876f15dc367 100644
--- a/solr/solr-ref-guide/modules/indexing-guide/pages/analyzers.adoc
+++ b/solr/solr-ref-guide/modules/indexing-guide/pages/analyzers.adoc
@@ -180,7 +180,7 @@ For most use cases, this provides the best possible 
behavior, but if you wish fo
   </analyzer>
   <!-- No analysis at all when doing queries that involved Multi-Term 
expansion -->
   <analyzer type="multiterm">
-    <tokenizer class="solr.KeywordTokenizerFactory" />
+    <tokenizer name="keyword" />
   </analyzer>
 </fieldType>
 ----
diff --git 
a/solr/solr-ref-guide/modules/indexing-guide/pages/charfilterfactories.adoc 
b/solr/solr-ref-guide/modules/indexing-guide/pages/charfilters.adoc
similarity index 99%
rename from 
solr/solr-ref-guide/modules/indexing-guide/pages/charfilterfactories.adoc
rename to solr/solr-ref-guide/modules/indexing-guide/pages/charfilters.adoc
index f20923fc034..abcf27c537d 100644
--- a/solr/solr-ref-guide/modules/indexing-guide/pages/charfilterfactories.adoc
+++ b/solr/solr-ref-guide/modules/indexing-guide/pages/charfilters.adoc
@@ -1,4 +1,4 @@
-= CharFilterFactories
+= CharFilters
 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements.  See the NOTICE file
 // distributed with this work for additional information
diff --git 
a/solr/solr-ref-guide/modules/indexing-guide/pages/document-analysis.adoc 
b/solr/solr-ref-guide/modules/indexing-guide/pages/document-analysis.adoc
index 5b56449c4d3..37d94f2ec63 100644
--- a/solr/solr-ref-guide/modules/indexing-guide/pages/document-analysis.adoc
+++ b/solr/solr-ref-guide/modules/indexing-guide/pages/document-analysis.adoc
@@ -50,7 +50,7 @@ It also serves as a guide so that you can configure your own 
analysis classes if
 | xref:analyzers.adoc[]: Overview of Solr analyzers.
 | xref:tokenizers.adoc[]: Tokenizers and tokenizer factory classes.
 | xref:filters.adoc[]: Filters and filter factory classes.
-| xref:charfilterfactories.adoc[]: Filters for pre-processing input characters.
+| xref:charfilters.adoc[]: Filters for pre-processing input characters.
 | xref:language-analysis.adoc[]: Tokenizers and filters for character set 
conversion and specific languages.
 | xref:analysis-screen.adoc[]: Admin UI for testing field analysis.
 |===
diff --git 
a/solr/solr-ref-guide/modules/indexing-guide/pages/field-type-definitions-and-properties.adoc
 
b/solr/solr-ref-guide/modules/indexing-guide/pages/field-type-definitions-and-properties.adoc
index e415023984e..39f2ee5751f 100644
--- 
a/solr/solr-ref-guide/modules/indexing-guide/pages/field-type-definitions-and-properties.adoc
+++ 
b/solr/solr-ref-guide/modules/indexing-guide/pages/field-type-definitions-and-properties.adoc
@@ -201,7 +201,7 @@ The table below includes the default value for most 
`FieldType` implementations
 |`sortMissingFirst`, `sortMissingLast` |Control the placement of documents 
when a sort field is not present. |`false`
 |`multiValued` |If `true`, indicates that a single document might contain 
multiple values for this field type. |`false`
 |`uninvertible` |If `true`, indicates that an `indexed="true" 
docValues="false"` field can be "un-inverted" at query time to build up large 
in memory data structure to serve in place of xref:docvalues.adoc[]. *Defaults 
to `true` for historical reasons, but users are strongly encouraged to set this 
to `false` for stability and use `docValues="true"` as needed.* |`true`
-|`omitNorms` |If `true`, omits the norms associated with this field (this 
disables length normalization for the field, and saves some memory). *Defaults 
to true for all primitive (non-analyzed) field types, such as int, float, data, 
bool, and string.* Only full-text fields or fields need norms. |*
+|`omitNorms` |If `true`, omits the norms associated with this field (this 
disables length normalization for the field, and saves some memory). *Defaults 
to true for all primitive (non-analyzed) field types, such as int, float, data, 
bool, and string.* Only full-text fields or fields that need an index-time 
boost need norms. |*
 |`omitTermFreqAndPositions` |If `true`, omits term frequency, positions, and 
payloads from postings for this field. This can be a performance boost for 
fields that don't require that information. It also reduces the storage space 
required for the index. Queries that rely on position that are issued on a 
field with this option will silently fail to find documents. *This property 
defaults to true for all field types that are not text fields.* |*
 |`omitPositions` |Similar to `omitTermFreqAndPositions` but preserves term 
frequency information. |*
 |`termVectors`, `termPositions`, `termOffsets`, `termPayloads` |These options 
instruct Solr to maintain full term vectors for each document, optionally 
including position, offset, and payload information for each term occurrence in 
those vectors. These can be used to accelerate highlighting and other ancillary 
functionality, but impose a substantial cost in terms of index size. They are 
not necessary for typical uses of Solr. |`false`
diff --git a/solr/solr-ref-guide/modules/indexing-guide/pages/filters.adoc 
b/solr/solr-ref-guide/modules/indexing-guide/pages/filters.adoc
index b9089f0c8d1..fa2c3365a56 100644
--- a/solr/solr-ref-guide/modules/indexing-guide/pages/filters.adoc
+++ b/solr/solr-ref-guide/modules/indexing-guide/pages/filters.adoc
@@ -263,8 +263,7 @@ The value `auto` will allow the filter to identify the 
language, or a comma-sepa
 ----
 <analyzer>
   <tokenizer name="standard"/>
-  <filter name="beiderMorse" nameType="GENERIC" ruleType="APPROX" 
concat="true" languageSet="auto">
-  </filter>
+  <filter name="beiderMorse" nameType="GENERIC" ruleType="APPROX" 
concat="true" languageSet="auto"/>
 </analyzer>
 ----
 ====
@@ -275,8 +274,7 @@ The value `auto` will allow the filter to identify the 
language, or a comma-sepa
 ----
 <analyzer>
   <tokenizer class="solr.StandardTokenizerFactory"/>
-  <filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" 
ruleType="APPROX" concat="true" languageSet="auto">
-  </filter>
+  <filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" 
ruleType="APPROX" concat="true" languageSet="auto"/>
 </analyzer>
 ----
 ====
@@ -327,7 +325,7 @@ This filter takes the output of the 
xref:tokenizers.adoc#classic-tokenizer[Class
 == Common Grams Filter
 
 This filter for use in `index` time analysis creates word shingles by 
combining common tokens such as stop words with regular tokens.
-This can result in an index with more unique terms, but is useful for creating 
phrase queries containing common words, such as "the cat", in a way that will 
typically be much faster then if the combined tokens are not used, because only 
the term positions of documents containg both terms in sequence have to be 
considered.
+This can result in an index with more unique terms, but is useful for creating 
phrase queries containing common words, such as "the cat", in a way that will 
typically be much faster than if the combined tokens are not used, because only 
the term positions of documents containing both terms in sequence have to be 
considered.
 Correct usage requires being paired with <<Common Grams Query Filter>> during 
`query` analysis.
 
 These filters can also be combined with <<Stop Filter>> so searching for `"the 
cat"` would match different documents then `"a cat"`, while pathological 
searches for either `"the"` or `"a"` would not match any documents.
@@ -409,7 +407,7 @@ If `true`, the filter ignores the case of words when 
comparing them to the commo
 
 == Common Grams Query Filter
 
-This filter is used for the `query` time analysis aspect of <<Common Grams 
Filter>> -- see that filer for a description of arguments, example 
configuration, and sample input/output.
+This filter is used for the `query` time analysis aspect of <<Common Grams 
Filter>> -- see that filter for a description of arguments, example 
configuration, and sample input/output.
 
 == Collation Key Filter
 
@@ -580,8 +578,8 @@ The character used to separate the token and the boost.
 [source,xml]
 ----
 <analyzer>
-<tokenizer name="standard"/>
-<filter name="delimitedBoost"/>
+  <tokenizer name="standard"/>
+  <filter name="delimitedBoost"/>
 </analyzer>
 ----
 ====
@@ -591,8 +589,8 @@ The character used to separate the token and the boost.
 [source,xml]
 ----
 <analyzer>
-<tokenizer class="solr.StandardTokenizerFactory"/>
-<filter class="solr.DelimitedBoostTokenFilterFactory"/>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.DelimitedBoostTokenFilterFactory"/>
 </analyzer>
 ----
 ====
@@ -613,8 +611,8 @@ Using a different delimiter (`delimiter="/"`).
 [source,xml]
 ----
 <analyzer>
-<tokenizer name="standard"/>
-<filter name="delimitedBoost" delimiter="/"/>
+  <tokenizer name="standard"/>
+  <filter name="delimitedBoost" delimiter="/"/>
 </analyzer>
 ----
 
@@ -638,17 +636,17 @@ This filter generates edge n-gram tokens of sizes within 
the given range.
 +
 [%autowidth,frame=none]
 |===
-|Optional |Default: `1`
+|Required |Default: none
 |===
-The minimum gram size.
+The minimum gram size, must be > 0.
 
 `maxGramSize`::
 +
 [%autowidth,frame=none]
 |===
-|Optional |Default: `1`
+|Required |Default: none
 |===
-The maximum gram size.
+The maximum gram size, must be >= `minGramSize`.
 
 `preserveOriginal`::
 +
@@ -672,7 +670,7 @@ Default behavior.
 ----
 <analyzer>
   <tokenizer name="standard"/>
-  <filter name="edgeNGram"/>
+  <filter name="edgeNGram" minGramSize="1" maxGramSize="1"/>
 </analyzer>
 ----
 ====
@@ -683,7 +681,7 @@ Default behavior.
 ----
 <analyzer>
   <tokenizer class="solr.StandardTokenizerFactory"/>
-  <filter class="solr.EdgeNGramFilterFactory"/>
+  <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="1"/>
 </analyzer>
 ----
 ====
@@ -947,6 +945,15 @@ The path of a rules file.
 +
 Controls whether matching is case sensitive or not.
 
+`longestOnly`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: `false`
+|===
++
+If `true`, only the longest term is emitted.
+
 `strictAffixParsing`::
 +
 [%autowidth,frame=none]
@@ -1108,7 +1115,7 @@ For detailed information on this normalization form, see 
http://www.unicode.org/
 
 == ICU Normalizer 2 Filter
 
-This filter factory normalizes text according to one of five Unicode 
Normalization Forms as described in http://unicode.org/reports/tr15/[Unicode 
Standard Annex #15]:
+This filter normalizes text according to one of five Unicode Normalization 
Forms as described in http://unicode.org/reports/tr15/[Unicode Standard Annex 
#15]:
 
 * NFC: (`name="nfc" mode="compose"`) Normalization Form C, canonical 
decomposition
 * NFD: (`name="nfc" mode="decompose"`) Normalization Form D, canonical 
decomposition, followed by canonical composition
@@ -1214,6 +1221,15 @@ s|Required |Default: none
 The identifier for the ICU System Transform you wish to apply with this filter.
 For a full list of ICU System Transforms, see 
http://demo.icu-project.org/icu-bin/translit?TEMPLATE_FILE=data/translit_rule_main.html.
 
+`direction`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `forward`
+|===
++
+The direction of the ICU transform. Valid options are `forward` and `reverse`.
+
 *Example:*
 
 [.dynamic-tabs]
@@ -1268,6 +1284,15 @@ Path to a text file containing the list of keep words, 
one per line.
 Blank lines and lines that begin with `\#` are ignored.
 This may be an absolute path, or a simple filename in the Solr `conf` 
directory.
 
+`format`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: none
+|===
++
+If the keepwords list has been formatted for Snowball, you can specify 
`format="snowball"` so Solr can read the keepwords file.
+
 `ignoreCase`::
 +
 [%autowidth,frame=none]
@@ -1467,12 +1492,14 @@ Tokens longer than this are discarded.
 This filter limits the number of accepted tokens, typically useful for index 
analysis.
 
 By default, this filter ignores any tokens in the wrapped `TokenStream` once 
the limit has been reached, which can result in `reset()` being called prior to 
`incrementToken()` returning `false`.
-For most `TokenStream` implementations this should be acceptable, and faster 
then consuming the full stream.
+For most `TokenStream` implementations this should be acceptable, and faster 
than consuming the full stream.
 If you are wrapping a `TokenStream` which requires that the full stream of 
tokens be exhausted in order to function properly, use the 
`consumeAllTokens="true"` option.
 
 *Factory class:* `solr.LimitTokenCountFilterFactory`
 
 *Arguments:*
+
+`maxTokenCount`::
 +
 [%autowidth,frame=none]
 |===
@@ -1534,14 +1561,21 @@ This filter limits tokens to those before a configured 
maximum start character o
 This can be useful to limit highlighting, for example.
 
 By default, this filter ignores any tokens in the wrapped `TokenStream` once 
the limit has been reached, which can result in `reset()` being called prior to 
`incrementToken()` returning `false`.
-For most `TokenStream` implementations this should be acceptable, and faster 
then consuming the full stream.
+For most `TokenStream` implementations this should be acceptable, and faster 
than consuming the full stream.
 If you are wrapping a `TokenStream` which requires that the full stream of 
tokens be exhausted in order to function properly, use the 
`consumeAllTokens="true"` option.
 
 *Factory class:* `solr.LimitTokenOffsetFilterFactory`
 
 *Arguments:*
 
-`maxStartOffset`:: (integer, required) Maximum token start character offset.
+`maxStartOffset`::
++
+[%autowidth,frame=none]
+|===
+s|Required |Default: none
+|===
++
+Maximum token start character offset.
 After this limit has been reached, tokens are discarded.
 
 `consumeAllTokens`::
@@ -1595,7 +1629,7 @@ See description above.
 This filter limits tokens to those before a configured maximum token position.
 
 By default, this filter ignores any tokens in the wrapped `TokenStream` once 
the limit has been reached, which can result in `reset()` being called prior to 
`incrementToken()` returning `false`.
-For most `TokenStream` implementations this should be acceptable, and faster 
then consuming the full stream.
+For most `TokenStream` implementations this should be acceptable, and faster 
than consuming the full stream.
 If you are wrapping a `TokenStream` which requires that the full stream of 
tokens be exhausted in order to function properly, use the 
`consumeAllTokens="true"` option.
 
 *Factory class:* `solr.LimitTokenPositionFilterFactory`
@@ -1628,7 +1662,7 @@ See description above.
 --
 [example.tab-pane#byname-filter-limittokenposition]
 ====
-[.tab-label]*With name)*
+[.tab-label]*With name*
 [source,xml]
 ----
 <analyzer>
@@ -1832,7 +1866,7 @@ This filter would normally be preceded by a <<Shingle 
Filter>>, as shown in the
 
 Each input token is hashed.
 It is subsequently "rehashed" `hashCount` times by combining with a set of 
precomputed hashes.
-For each of the resulting hashes, the hash space is divided in to 
`bucketCount` buckets.
+For each of the resulting hashes, the hash space is divided into `bucketCount` 
buckets.
 The lowest set of `hashSetSize` hashes (usually a set of one) is generated for 
each bucket.
 
 This filter generates one type of signature or sketch for the input tokens and 
can be used to compute Jaccard similarity between documents.
@@ -1881,6 +1915,24 @@ With the default settings for `withRotation`, the number 
of hashes generated is
 
 *Example:*
 
+[.dynamic-tabs]
+--
+[example.tab-pane#byname-filter-minhash]
+====
+[.tab-label]*With name*
+[source,xml]
+----
+<analyzer>
+  <tokenizer name="icu"/>
+  <filter name="icuFolding"/>
+  <filter name="shingle" minShingleSize="5" outputUnigrams="false" 
outputUnigramsIfNoShingles="false" maxShingleSize="5" tokenSeparator=" "/>
+  <filter name="minHash" bucketCount="512" hashSetSize="1" hashCount="1"/>
+</analyzer>
+----
+====
+[example.tab-pane#byclass-filter-minhash]
+====
+[.tab-label]*With class name (legacy)*
 [source,xml]
 ----
 <analyzer>
@@ -1890,6 +1942,8 @@ With the default settings for `withRotation`, the number 
of hashes generated is
   <filter class="org.apache.lucene.analysis.minhash.MinHashFilterFactory" 
bucketCount="512" hashSetSize="1" hashCount="1"/>
 </analyzer>
 ----
+====
+--
 
 *In:* "woof woof woof woof woof"
 
@@ -1910,18 +1964,18 @@ Note that tokens are ordered by position and then by 
gram size.
 +
 [%autowidth,frame=none]
 |===
-|Optional |Default: `1`
+|Required |Default: none
 |===
-The minimum gram size.
+The minimum gram size, must be > 0.
 
 `maxGramSize`::
 +
 [%autowidth,frame=none]
 |===
-|Optional |Default: `2`
+|Required |Default: none
 |===
 +
-The maximum gram size.
+The maximum gram size, must be >= `minGramSize`.
 
 `preserveOriginal`::
 +
@@ -1945,7 +1999,7 @@ Default behavior.
 ----
 <analyzer>
   <tokenizer name="standard"/>
-  <filter name="nGram"/>
+  <filter name="nGram" minGramSize="1" maxGramSize="2"/>
 </analyzer>
 ----
 ====
@@ -1956,7 +2010,7 @@ Default behavior.
 ----
 <analyzer>
   <tokenizer class="solr.StandardTokenizerFactory"/>
-  <filter class="solr.NGramFilterFactory"/>
+  <filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="2"/>
 </analyzer>
 ----
 ====
@@ -2087,7 +2141,7 @@ Tokens with a matching type name will have their payload 
set to the above floati
 == Pattern Replace Filter
 
 This filter applies a regular expression to each token and, for those that 
match, substitutes the given replacement string in place of the matched pattern.
-Tokens which do not match are passed though unchanged.
+Tokens which do not match are passed through unchanged.
 
 *Factory class:* `solr.PatternReplaceFilterFactory`
 
@@ -3096,10 +3150,10 @@ If `false`, all equivalent synonyms will be reduced to 
the first in the list.
 +
 [%autowidth,frame=none]
 |===
-|Optional |Default: none
+|Optional |Default: `solr`
 |===
 +
-(optional; default: `solr`) Controls how the synonyms will be parsed.
+Controls how the synonyms will be parsed.
 The short names `solr` (for 
{lucene-javadocs}/analysis/common/org/apache/lucene/analysis/synonym/SolrSynonymParser.html[`SolrSynonymParser`]
 and `wordnet` (for 
{lucene-javadocs}/analysis/common/org/apache/lucene/analysis/synonym/WordnetSynonymParser.html[`WordnetSynonymParser`]
 ) are supported.
 You may alternatively supply the name of your own 
{lucene-javadocs}/analysis/common/org/apache/lucene/analysis/synonym/SynonymMap.Builder.html[`SynonymMap.Builder`]
 subclass.
 
@@ -3368,6 +3422,25 @@ This filter adds the token's type, as a token at the 
same position as the token,
 +
 The prefix to prepend to the token's type.
 
+`ignore`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: none
+|===
++
+A comma-separated list of types to ignore and not convert to synonyms.
+
+`synFlagsMask`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: see description
+|===
++
+A mask (provided as an integer) to control what flags are propagated to the 
synonyms.
+The default value is an integer `-1`, i.e., the mask `0xFFFFFFFF` - this mask 
propagates any flags as is.
+
 *Examples:*
 
 With the example below, each token's type will be emitted verbatim at the same 
position:
@@ -3614,6 +3687,15 @@ The path to a file that contains a list of protected 
words that should be passed
 +
 If `1`, strips the possessive `'s` from each subword.
 
+`adjustOffsets`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: `true`
+|===
++
+If `true`, the offsets of partial terms are adjusted.
+
 `types`::
 +
 [%autowidth,frame=none]
diff --git 
a/solr/solr-ref-guide/modules/indexing-guide/pages/schema-elements.adoc 
b/solr/solr-ref-guide/modules/indexing-guide/pages/schema-elements.adoc
index abb1b7492cf..ee45819dab2 100644
--- a/solr/solr-ref-guide/modules/indexing-guide/pages/schema-elements.adoc
+++ b/solr/solr-ref-guide/modules/indexing-guide/pages/schema-elements.adoc
@@ -95,18 +95,23 @@ However, `uniqueKey` will continue to work, as long as the 
field is properly use
 Similarity is a Lucene class used to score a document in searching.
 
 Each collection has one "global" Similarity.
-By default Solr uses an implicit 
{solr-javadocs}/core/org/apache/solr/search/similarities/SchemaSimilarityFactory.html[`SchemaSimilarityFactory`]
 which allows individual field types to be configured with a "per-type" 
specific Similarity and implicitly uses `BM25Similarity` for any field type 
which does not have an explicit Similarity.
+By default, Solr uses an implicit 
{solr-javadocs}/core/org/apache/solr/search/similarities/SchemaSimilarityFactory.html[`SchemaSimilarityFactory`]
 which allows individual field types to be configured with a "per-type" 
specific Similarity and implicitly uses `BM25Similarity` for any field type 
which does not have an explicit Similarity.
 
 This default behavior can be overridden by declaring a top level 
`<similarity/>` element in your schema, outside of any single field type.
 This similarity declaration can either refer directly to the name of a class 
with a no-argument constructor, such as in this example showing 
`BM25Similarity`:
 
 [source,xml]
 ----
-<similarity class="solr.BM25SimilarityFactory"/>
+<similarity class="org.apache.lucene.search.similarities.BM25Similarity"/>
 ----
 
-or by referencing a `SimilarityFactory` implementation, which may take 
optional initialization parameters:
+or by referencing a `SimilarityFactory` implementation:
+[source,xml]
+----
+<similarity class="solr.BM25SimilarityFactory"/>
+----
 
+When using the similarity factory, it is possible to specify optional 
initialization parameters:
 [source,xml]
 ----
 <similarity class="solr.DFRSimilarityFactory">
diff --git a/solr/solr-ref-guide/modules/indexing-guide/pages/tokenizers.adoc 
b/solr/solr-ref-guide/modules/indexing-guide/pages/tokenizers.adoc
index 24d04ec1802..fbb69399efc 100644
--- a/solr/solr-ref-guide/modules/indexing-guide/pages/tokenizers.adoc
+++ b/solr/solr-ref-guide/modules/indexing-guide/pages/tokenizers.adoc
@@ -100,7 +100,7 @@ Arguments may be passed to tokenizer factories by setting 
attributes on the `<to
 
 === When to Use a CharFilter vs. a TokenFilter
 
-There are several pairs of CharFilters and TokenFilters that have related 
(i.e., `MappingCharFilter` and `ASCIIFoldingFilter`) or nearly identical (i.e., 
`PatternReplaceCharFilterFactory` and `PatternReplaceFilterFactory`) 
functionality and it may not always be obvious which is the best choice.
+There are several pairs of CharFilters and TokenFilters that have related 
(i.e., `MappingCharFilter` and `ASCIIFoldingFilter`) or nearly identical (i.e., 
`PatternReplaceCharFilterFactory` and `PatternReplaceFilterFactory`) 
functionality, and it may not always be obvious which is the best choice.
 
 The decision about which to use depends largely on which Tokenizer you are 
using, and whether you need to preprocess the stream of characters.
 
@@ -125,7 +125,14 @@ The Standard Tokenizer supports 
http://unicode.org/reports/tr29/#Word_Boundaries
 
 *Arguments:*
 
-`maxTokenLength`: (integer, default 255) Solr ignores tokens that exceed the 
number of characters specified by `maxTokenLength`.
+`maxTokenLength`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `255`
+|===
++
+Solr ignores tokens that exceed the number of characters specified by 
`maxTokenLength`.
 
 *Example:*
 
@@ -174,7 +181,14 @@ Delimiter characters are discarded, with the following 
exceptions:
 
 *Arguments:*
 
-`maxTokenLength`: (integer, default 255) Solr ignores tokens that exceed the 
number of characters specified by `maxTokenLength`.
+`maxTokenLength`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `255`
+|===
++
+Solr ignores tokens that exceed the number of characters specified by 
`maxTokenLength`.
 
 *Example:*
 
@@ -212,7 +226,16 @@ This tokenizer treats the entire text field as a single 
token.
 
 *Factory class:* `solr.KeywordTokenizerFactory`
 
-*Arguments:* None
+*Arguments:*
+
+`maxTokenLen`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `256`
+|===
++
+Maximum token length the tokenizer will emit.
 
 *Example:*
 
@@ -250,7 +273,16 @@ This tokenizer creates tokens from strings of contiguous 
letters, discarding all
 
 *Factory class:* `solr.LetterTokenizerFactory`
 
-*Arguments:* None
+*Arguments:*
+
+`maxTokenLen`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `255`
+|===
++
+Maximum token length the tokenizer will emit.
 
 *Example:*
 
@@ -289,7 +321,16 @@ Whitespace and non-letters are discarded.
 
 *Factory class:* `solr.LowerCaseTokenizerFactory`
 
-*Arguments:* None
+*Arguments:*
+
+`maxTokenLen`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `255`
+|===
++
+Maximum token length the tokenizer will emit.
 
 *Example:*
 
@@ -329,9 +370,23 @@ Reads the field text and generates n-gram tokens of sizes 
in the given range.
 
 *Arguments:*
 
-`minGramSize`: (integer, default 1) The minimum n-gram size, must be > 0.
+`minGramSize`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `1`
+|===
++
+The minimum n-gram size, must be > 0.
 
-`maxGramSize`: (integer, default 2) The maximum n-gram size, must be >= 
`minGramSize`.
+`maxGramSize`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `2`
+|===
++
+The maximum n-gram size, must be >= `minGramSize`.
 
 *Example:*
 
@@ -408,9 +463,23 @@ Reads the field text and generates edge n-gram tokens of 
sizes in the given rang
 
 *Arguments:*
 
-`minGramSize`: (integer, default is 1) The minimum n-gram size, must be > 0.
+`minGramSize`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `1`
+|===
++
+The minimum n-gram size, must be > 0.
 
-`maxGramSize`: (integer, default is 1) The maximum n-gram size, must be >= 
`minGramSize`.
+`maxGramSize`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `1`
+|===
++
+The maximum n-gram size, must be >= `minGramSize`.
 
 *Example:*
 
@@ -490,7 +559,33 @@ The default configuration for `solr.ICUTokenizerFactory` 
provides UAX#29 word br
 
 *Arguments:*
 
-`rulefile`: a comma-separated list of `code:rulefile` pairs in the following 
format: four-letter ISO 15924 script code, followed by a colon, then a resource 
path.
+`rulefile`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: none
+|===
++
+A comma-separated list of `code:rulefile` pairs in the following format: 
four-letter ISO 15924 script code, followed by a colon, then a resource path.
+
+`cjkAsWords`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `true`
+|===
++
+If `true`, CJK text would undergo dictionary-based segmentation, and all 
Han+Hiragana+Katakana words will be tagged as IDEOGRAPHIC.
+Otherwise, text will be segmented according to UAX#29 defaults.
+
+`myanmarAsWords`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `true`
+|===
++
+If `true`, Myanmar text would undergo dictionary-based segmentation, otherwise 
it will be tokenized as syllables.
 
 *Example:*
 
@@ -562,13 +657,47 @@ This tokenizer creates synonyms from file path 
hierarchies.
 
 *Arguments:*
 
-`delimiter`: (character, no default) You can specify the file path delimiter 
and replace it with a delimiter you provide.
+`delimiter`::
++
+[%autowidth,frame=none]
+|===
+s|Required |Default: none
+|===
++
+You can specify the file path delimiter and replace it with a delimiter you 
provide.
 This can be useful for working with backslash delimiters.
 
-`replace`: (character, no default) Specifies the delimiter character Solr uses 
in the tokenized output.
+`replace`::
++
+[%autowidth,frame=none]
+|===
+s|Required |Default: none
+|===
++
+Specifies the delimiter character Solr uses in the tokenized output.
+
+`reverse`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `false`
+|===
++
+If `true`, switch the tokenizer behavior to build the path hierarchy in 
"reversed" order.
+This is typically useful for tokenizing the URLs.
+
+`skip`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `0`
+|===
++
+Number of leftmost (or rightmost, if reverse=true) path elements to drop from 
each emitted token.
 
 *Example:*
 
+Default behavior
 [.dynamic-tabs]
 --
 [example.tab-pane#byname-tokenizer-pathhierarchy]
@@ -601,6 +730,41 @@ This can be useful for working with backslash delimiters.
 
 *Out:* "c:", "c:/usr", "c:/usr/local", "c:/usr/local/apache"
 
+*Example:*
+
+Reverse order
+[.dynamic-tabs]
+--
+[example.tab-pane#byname-tokenizer-pathhierarchy-reversed]
+====
+[.tab-label]*With name*
+[source,xml]
+----
+<fieldType name="text_path" class="solr.TextField" positionIncrementGap="100">
+  <analyzer>
+    <tokenizer name="pathHierarchy" delimiter="." replace="." reverse="true"/>
+  </analyzer>
+</fieldType>
+----
+====
+[example.tab-pane#byclass-tokenizer-pathhierarchy-reversed]
+====
+[.tab-label]*With class name (legacy)*
+[source,xml]
+----
+<fieldType name="text_path" class="solr.TextField" positionIncrementGap="100">
+  <analyzer>
+    <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="." 
replace="." reverse="true"/>
+  </analyzer>
+</fieldType>
+----
+====
+--
+
+*In:* "www.site.co.uk"
+
+*Out:* "www.site.co.uk", "site.co.uk", "co.uk", "uk"
+
 == Regular Expression Pattern Tokenizer
 
 This tokenizer uses a Java regular expression to break the input text stream 
into tokens.
@@ -612,9 +776,23 @@ See {java-javadocs}java/util/regex/Pattern.html[the 
Javadocs for `java.util.rege
 
 *Arguments:*
 
-`pattern`: (Required) The regular expression, as defined by in 
`java.util.regex.Pattern`.
+`pattern`::
++
+[%autowidth,frame=none]
+|===
+s|Required |Default: none
+|===
++
+The regular expression, as defined by in `java.util.regex.Pattern`.
 
-`group`: (Optional, default -1) Specifies which regex group to extract as the 
token(s).
+`group`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `-1`
+|===
++
+Specifies which regex group to extract as the token(s).
 The value -1 means the regex should be treated as a delimiter that separates 
tokens.
 Non-negative group numbers (>= 0) indicate that character sequences matching 
that regex group should be converted to tokens.
 Group zero refers to the entire regex, groups greater than zero refer to 
parenthesized sub-expressions of the regex, counted from left to right.
@@ -687,7 +865,7 @@ A sequence of at least one capital letter followed by zero 
or more letters of ei
 
 *Example:*
 
-Extract part numbers which are preceded by "SKU", "Part" or "Part Number", 
case sensitive, with an optional semi-colon separator.
+Extract part numbers which are preceded by "SKU", "Part" or "Part Number", 
case sensitive, with an optional semicolon separator.
 Part numbers must be all numeric digits, with an optional hyphen.
 Regex capture groups are numbered by counting left parenthesis from left to 
right.
 Group 3 is the subexpression "[0-9-]+", which matches one or more digits or 
hyphens.
@@ -729,11 +907,25 @@ The syntax is more limited than 
`PatternTokenizerFactory`, but the tokenization
 
 *Arguments:*
 
-`pattern`: (Required) The regular expression, as defined by in the 
{lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] 
javadocs, identifying the characters to include in tokens.
+`pattern`::
++
+[%autowidth,frame=none]
+|===
+s|Required |Default: none
+|===
++
+The regular expression, as defined in the 
{lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] 
javadocs, identifying the characters to include in tokens.
 The matching is greedy such that the longest token matching at a given point 
is created.
 Empty tokens are never created.
 
-`maxDeterminizedStates`: (Optional, default 10000) the limit on total state 
count for the determined automaton computed from the regexp.
+`determinizeWorkLimit`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `10000`
+|===
++
+The limit on total state count for the determined automaton computed from the 
regexp.
 
 *Example:*
 
@@ -772,11 +964,25 @@ The syntax is more limited than 
`PatternTokenizerFactory`, but the tokenization
 
 *Arguments:*
 
-`pattern`: (Required) The regular expression, as defined by in the 
{lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] 
javadocs, identifying the characters that should split tokens.
+`pattern`::
++
+[%autowidth,frame=none]
+|===
+s|Required |Default: none
+|===
++
+The regular expression, as defined by in the 
{lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] 
javadocs, identifying the characters that should split tokens.
 The matching is greedy such that the longest token separator matching at a 
given point is matched.
 Empty tokens are never created.
 
-`maxDeterminizedStates`: (Optional, default 10000) the limit on total state 
count for the determined automaton computed from the regexp.
+`determinizeWorkLimit`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `10000`
+|===
++
+The limit on total state count for the determined automaton computed from the 
regexp.
 
 *Example:*
 
@@ -827,7 +1033,14 @@ The UAX29 URL Email Tokenizer supports 
http://unicode.org/reports/tr29/#Word_Bou
 
 *Arguments:*
 
-`maxTokenLength`: (integer, default 255) Solr ignores tokens that exceed the 
number of characters specified by `maxTokenLength`.
+`maxTokenLength`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `255`
+|===
++
+Solr ignores tokens that exceed the number of characters specified by 
`maxTokenLength`.
 
 *Example:*
 
@@ -881,6 +1094,15 @@ Valid values:
 * `java`: Uses 
{java-javadocs}java/lang/Character.html#isWhitespace-int-[Character.isWhitespace(int)]
 * `unicode`: Uses Unicode's WHITESPACE property
 
+`maxTokenLen`::
++
+[%autowidth,frame=none]
+|===
+s|Optional |Default: `255`
+|===
++
+Maximum token length the tokenizer will emit.
+
 *Example:*
 
 [.dynamic-tabs]

[solr] branch branch_9x updated: Update indexing docs in Solr ref guide (#1961)

Reply via email to