This is an automated email from the ASF dual-hosted git repository.
ctargett pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/solr.git
The following commit(s) were added to refs/heads/main by this push:
new 7d75657 SOLR-12255: Add docs for Nori Korean tokenizer (#270)
7d75657 is described below
commit 7d75657d1f5de5f7d5de01a34a7e68aee138b51f
Author: Cassandra Targett <[email protected]>
AuthorDate: Mon Sep 6 14:34:25 2021 -0500
SOLR-12255: Add docs for Nori Korean tokenizer (#270)
---
solr/solr-ref-guide/src/caches-warming.adoc | 2 +-
solr/solr-ref-guide/src/filters.adoc | 9 ++
solr/solr-ref-guide/src/language-analysis.adoc | 127 +++++++++++++++++++++++++
3 files changed, 137 insertions(+), 1 deletion(-)
diff --git a/solr/solr-ref-guide/src/caches-warming.adoc
b/solr/solr-ref-guide/src/caches-warming.adoc
index 7013c18..52678d3 100644
--- a/solr/solr-ref-guide/src/caches-warming.adoc
+++ b/solr/solr-ref-guide/src/caches-warming.adoc
@@ -187,7 +187,7 @@ Ideally, this number should be as close to 1 as possible.
If you find that you have a low hit ratio but you've set your cache size high,
you can optimize by reducing the cache size - there's no need to keep those
objects in memory when they are not being used.
-Another useful metric is the cache evictions, which measures the ojects
removed from the cache.
+Another useful metric is the cache evictions, which measures the objects
removed from the cache.
A high rate of evictions can indicate that your cache is too small and
increasing it may show a higher hit ratio.
Alternatively, if your hit ratio is high but your evictions are low, your
cache might be too large and you may benefit from reducing the size.
diff --git a/solr/solr-ref-guide/src/filters.adoc
b/solr/solr-ref-guide/src/filters.adoc
index c3f1db5..95c56ae 100644
--- a/solr/solr-ref-guide/src/filters.adoc
+++ b/solr/solr-ref-guide/src/filters.adoc
@@ -2670,6 +2670,15 @@ If `true`, then individual tokens will be output if no
shingles are possible.
+
The string to use when joining adjacent tokens to form a shingle.
+`fillerToken`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: `_` (underscore)
+|===
++
+The character used to fill in for removed stop words in order to preserve
position increments.
+
*Example:*
Default behavior.
diff --git a/solr/solr-ref-guide/src/language-analysis.adoc
b/solr/solr-ref-guide/src/language-analysis.adoc
index e73453f..3a93f42 100644
--- a/solr/solr-ref-guide/src/language-analysis.adoc
+++ b/solr/solr-ref-guide/src/language-analysis.adoc
@@ -1079,6 +1079,7 @@ The languages covered here are:
| <<Italian>>
| <<Irish>>
| <<Japanese>>
+| <<Korean>>
| <<Latvian>>
| <<Norwegian>>
| <<Persian>>
@@ -1093,6 +1094,8 @@ The languages covered here are:
| <<Thai>>
| <<Turkish>>
| <<Ukrainian>>
+|
+|
|===
=== Arabic
@@ -2419,6 +2422,130 @@ Example:
====
--
+=== Korean
+
+The Korean (nori) analyzer integrates Lucene's nori analysis module into Solr.
+It uses the https://bitbucket.org/eunjeon/mecab-ko-dic[mecab-ko-dic]
dictionary to perform morphological analysis of Korean texts.
+
+The dictionary was built with http://taku910.github.io/mecab/[MeCab] and
defines a format for the features adapted for the Korean language.
+
+Nori also has a user dictionary feature that allows overriding the statistical
model with your own entries for segmentation, part-of-speech tags, and readings
without a need to specify weights.
+
+*Example*:
+
+[.dynamic-tabs]
+--
+[example.tab-pane#byname-lang-korean]
+====
+[.tab-label]*With name*
+[source,xml]
+----
+<fieldType name="text_ko" class="solr.TextField" positionIncrementGap="100">
+ <analyzer>
+ <tokenizer name="korean" decompoundMode="discard"
outputUnknownUnigrams="false"/>
+ <filter name="koreanPartOfSpeechStop" />
+ <filter name="koreanReadingForm" />
+ <filter name="lowercase" />
+ </analyzer>
+</fieldType>
+----
+====
+
+[example.tab-pane#byclass-lang-korean]
+====
+[.tab-label]*With class name (legacy)*
+[source,xml]
+----
+<fieldType name="text_ko" class="solr.TextField" positionIncrementGap="100">
+ <analyzer>
+ <tokenizer class="solr.KoreanTokenizerFactory" decompoundMode="discard"
outputUnknownUnigrams="false"/>
+ <filter class="solr.KoreanPartOfSpeechStopFilterFactory" />
+ <filter class="solr.KoreanReadingFormFilterFactory" />
+ <filter class="solr.LowerCaseFilterFactory" />
+ </analyzer>
+</fieldType>
+----
+====
+--
+
+
+==== Korean Tokenizer
+
+*Factory class*: `solr.KoreanTokenizerFactory`
+
+*SPI name*: `korean`
+
+*Arguments*:
+
+`userDictionary`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: none
+|===
++
+Path to a user-supplied dictionary to add custom nouns or compound terms to
the default dictionary.
+
+`userDictionaryEncoding`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: none
+|===
++
+Character encoding of the user dictionary.
+
+`decompoundMode`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: `discard`
+|===
++
+Defines how to handle compound tokens. The options are:
+
+* `none`: No decomposition for tokens.
+* `discard`: Tokens are decomposed and the original form is discarded.
+* `mixed`: Tokens are decomposed and the original form is retained.
+
+`outputUnknownUnigrams`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: `false`
+|===
++
+If `true`, unigrams will be output for unknown words.
+
+`discardPunctuation`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: `true`
+|===
++
+If `true`, punctuation will be discarded.
+
+==== Korean Part of Speech Stop Filter
+
+This filter removes tokens that match parts of speech tags.
+
+*Factory class*: `solr.KoreanPartOfSpeechStopFilterFactory`
+
+*SPI name*: `koreanPartOfSpeechStop`
+
+*Arguments*: None.
+
+==== Korean Reading Form Filter
+
+This filter replaces term text with the Reading Attribute, the Hangul
transcription of Hanja characters.
+
+*Factory class*: `solr.KoreanReadingFormFilterFactory`
+
+*SPI name*: `koreanReadingForm`
+
+*Arguments*: None.
+
[[hebrew-lao-myanmar-khmer]]
=== Hebrew, Lao, Myanmar, Khmer