[solr] branch main updated: SOLR-12255: Add docs for Nori Korean tokenizer (#270)

ctargett Mon, 06 Sep 2021 12:34:36 -0700

This is an automated email from the ASF dual-hosted git repository.

ctargett pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/solr.git



The following commit(s) were added to refs/heads/main by this push:
     new 7d75657  SOLR-12255: Add docs for Nori Korean tokenizer (#270)
7d75657 is described below

commit 7d75657d1f5de5f7d5de01a34a7e68aee138b51f
Author: Cassandra Targett <[email protected]>
AuthorDate: Mon Sep 6 14:34:25 2021 -0500

    SOLR-12255: Add docs for Nori Korean tokenizer (#270)
---
 solr/solr-ref-guide/src/caches-warming.adoc    |   2 +-
 solr/solr-ref-guide/src/filters.adoc           |   9 ++
 solr/solr-ref-guide/src/language-analysis.adoc | 127 +++++++++++++++++++++++++
 3 files changed, 137 insertions(+), 1 deletion(-)

diff --git a/solr/solr-ref-guide/src/caches-warming.adoc 
b/solr/solr-ref-guide/src/caches-warming.adoc
index 7013c18..52678d3 100644
--- a/solr/solr-ref-guide/src/caches-warming.adoc
+++ b/solr/solr-ref-guide/src/caches-warming.adoc
@@ -187,7 +187,7 @@ Ideally, this number should be as close to 1 as possible.
 
 If you find that you have a low hit ratio but you've set your cache size high, 
you can optimize by reducing the cache size - there's no need to keep those 
objects in memory when they are not being used.
 
-Another useful metric is the cache evictions, which measures the ojects 
removed from the cache.
+Another useful metric is the cache evictions, which measures the objects 
removed from the cache.
 A high rate of evictions can indicate that your cache is too small and 
increasing it may show a higher hit ratio.
 Alternatively, if your hit ratio is high but your evictions are low, your 
cache might be too large and you may benefit from reducing the size.
 
diff --git a/solr/solr-ref-guide/src/filters.adoc 
b/solr/solr-ref-guide/src/filters.adoc
index c3f1db5..95c56ae 100644
--- a/solr/solr-ref-guide/src/filters.adoc
+++ b/solr/solr-ref-guide/src/filters.adoc
@@ -2670,6 +2670,15 @@ If `true`, then individual tokens will be output if no 
shingles are possible.
 +
 The string to use when joining adjacent tokens to form a shingle.
 
+`fillerToken`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: `_` (underscore)
+|===
++
+The character used to fill in for removed stop words in order to preserve 
position increments.
+
 *Example:*
 
 Default behavior.
diff --git a/solr/solr-ref-guide/src/language-analysis.adoc 
b/solr/solr-ref-guide/src/language-analysis.adoc
index e73453f..3a93f42 100644
--- a/solr/solr-ref-guide/src/language-analysis.adoc
+++ b/solr/solr-ref-guide/src/language-analysis.adoc
@@ -1079,6 +1079,7 @@ The languages covered here are:
 | <<Italian>>
 | <<Irish>>
 | <<Japanese>>
+| <<Korean>>
 | <<Latvian>>
 | <<Norwegian>>
 | <<Persian>>
@@ -1093,6 +1094,8 @@ The languages covered here are:
 | <<Thai>>
 | <<Turkish>>
 | <<Ukrainian>>
+|
+|
 |===
 
 === Arabic
@@ -2419,6 +2422,130 @@ Example:
 ====
 --
 
+=== Korean
+
+The Korean (nori) analyzer integrates Lucene's nori analysis module into Solr.
+It uses the https://bitbucket.org/eunjeon/mecab-ko-dic[mecab-ko-dic] 
dictionary to perform morphological analysis of Korean texts.
+
+The dictionary was built with http://taku910.github.io/mecab/[MeCab] and 
defines a format for the features adapted for the Korean language.
+
+Nori also has a user dictionary feature that allows overriding the statistical 
model with your own entries for segmentation, part-of-speech tags, and readings 
without a need to specify weights.
+
+*Example*:
+
+[.dynamic-tabs]
+--
+[example.tab-pane#byname-lang-korean]
+====
+[.tab-label]*With name*
+[source,xml]
+----
+<fieldType name="text_ko" class="solr.TextField" positionIncrementGap="100">
+  <analyzer>
+    <tokenizer name="korean" decompoundMode="discard" 
outputUnknownUnigrams="false"/>
+    <filter name="koreanPartOfSpeechStop" />
+    <filter name="koreanReadingForm" />
+    <filter name="lowercase" />
+  </analyzer>
+</fieldType>
+----
+====
+
+[example.tab-pane#byclass-lang-korean]
+====
+[.tab-label]*With class name (legacy)*
+[source,xml]
+----
+<fieldType name="text_ko" class="solr.TextField" positionIncrementGap="100">
+  <analyzer>
+    <tokenizer class="solr.KoreanTokenizerFactory" decompoundMode="discard" 
outputUnknownUnigrams="false"/>
+    <filter class="solr.KoreanPartOfSpeechStopFilterFactory" />
+    <filter class="solr.KoreanReadingFormFilterFactory" />
+    <filter class="solr.LowerCaseFilterFactory" />
+  </analyzer>
+</fieldType>
+----
+====
+--
+
+
+==== Korean Tokenizer
+
+*Factory class*: `solr.KoreanTokenizerFactory`
+
+*SPI name*: `korean`
+
+*Arguments*:
+
+`userDictionary`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: none
+|===
++
+Path to a user-supplied dictionary to add custom nouns or compound terms to 
the default dictionary.
+
+`userDictionaryEncoding`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: none
+|===
++
+Character encoding of the user dictionary.
+
+`decompoundMode`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: `discard`
+|===
++
+Defines how to handle compound tokens. The options are:
+
+* `none`: No decomposition for tokens.
+* `discard`: Tokens are decomposed and the original form is discarded.
+* `mixed`: Tokens are decomposed and the original form is retained.
+
+`outputUnknownUnigrams`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: `false`
+|===
++
+If `true`, unigrams will be output for unknown words.
+
+`discardPunctuation`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: `true`
+|===
++
+If `true`, punctuation will be discarded.
+
+==== Korean Part of Speech Stop Filter
+
+This filter removes tokens that match parts of speech tags.
+
+*Factory class*: `solr.KoreanPartOfSpeechStopFilterFactory`
+
+*SPI name*: `koreanPartOfSpeechStop`
+
+*Arguments*: None.
+
+==== Korean Reading Form Filter
+
+This filter replaces term text with the Reading Attribute, the Hangul 
transcription of Hanja characters.
+
+*Factory class*: `solr.KoreanReadingFormFilterFactory`
+
+*SPI name*: `koreanReadingForm`
+
+*Arguments*: None.
+
 [[hebrew-lao-myanmar-khmer]]
 === Hebrew, Lao, Myanmar, Khmer

[solr] branch main updated: SOLR-12255: Add docs for Nori Korean tokenizer (#270)

Reply via email to