This is an automated email from the ASF dual-hosted git repository. nightowl888 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/lucenenet.git
commit bffd9b066f3236789ba2871cf6f61e769eb7f044 Author: Shad Storhaug <[email protected]> AuthorDate: Sun Mar 28 20:33:44 2021 +0700 docs: Lucene.Net.Analysis.Common: Fixed broken formatting and links (see #284, #300) --- .../Analysis/CharFilter/package.md | 8 +- .../Analysis/Cjk/package.md | 14 ++- .../Analysis/Cn/package.md | 15 ++- .../Analysis/Compound/package.md | 119 ++++++++++++--------- .../Analysis/Standard/Std31/package.md | 4 +- .../Analysis/Standard/Std34/package.md | 4 +- .../Analysis/Standard/Std36/package.md | 4 +- .../Analysis/Standard/Std40/package.md | 4 +- .../Analysis/Standard/package.md | 80 +++++++------- 9 files changed, 146 insertions(+), 106 deletions(-) diff --git a/src/Lucene.Net.Analysis.Common/Analysis/CharFilter/package.md b/src/Lucene.Net.Analysis.Common/Analysis/CharFilter/package.md index 929671f..b3ddf5b 100644 --- a/src/Lucene.Net.Analysis.Common/Analysis/CharFilter/package.md +++ b/src/Lucene.Net.Analysis.Common/Analysis/CharFilter/package.md @@ -1,4 +1,4 @@ ---- +--- uid: Lucene.Net.Analysis.CharFilters summary: *content --- @@ -24,10 +24,10 @@ summary: *content CharFilters are chainable filters that normalize text before tokenization and provide mappings between normalized text offsets and the corresponding offset in the original text. -<h2>CharFilter offset mappings</h2> +## CharFilter offset mappings CharFilters modify an input stream via a series of substring replacements (including deletions and insertions) to produce an output stream. There are three possible replacement cases: the replacement string has the same length as the original substring; the replacement is shorter; and the replacement is longer. In the latter two cases (when the replacement has a different length than the original), one or more offset correction mappings are required. - When the replacement is shorter than the original (e.g. when the replacement is the empty string), a single offset correction mapping should be added at the replacement's end offset in the output stream. The `cumulativeDiff` parameter to the `addOffCorrectMapping()` method will be the sum of all previous replacement offset adjustments, with the addition of the difference between the lengths of the original substring and the replacement string (a positive value). + When the replacement is shorter than the original (e.g. when the replacement is the empty string), a single offset correction mapping should be added at the replacement's end offset in the output stream. The `cumulativeDiff` parameter to the `AddOffCorrectMap()` method will be the sum of all previous replacement offset adjustments, with the addition of the difference between the lengths of the original substring and the replacement string (a positive value). - When the replacement is longer than the original (e.g. when the original is the empty string), you should add as many offset correction mappings as the difference between the lengths of the replacement string and the original substring, starting at the end offset the original substring would have had in the output stream. The `cumulativeDiff` parameter to the `addOffCorrectMapping()` method will be the sum of all previous replacement offset adjustments, with the addition of the differen [...] \ No newline at end of file + When the replacement is longer than the original (e.g. when the original is the empty string), you should add as many offset correction mappings as the difference between the lengths of the replacement string and the original substring, starting at the end offset the original substring would have had in the output stream. The `cumulativeDiff` parameter to the `AddOffCorrectMap()` method will be the sum of all previous replacement offset adjustments, with the addition of the difference b [...] \ No newline at end of file diff --git a/src/Lucene.Net.Analysis.Common/Analysis/Cjk/package.md b/src/Lucene.Net.Analysis.Common/Analysis/Cjk/package.md index 8f45e50..946ee7d 100644 --- a/src/Lucene.Net.Analysis.Common/Analysis/Cjk/package.md +++ b/src/Lucene.Net.Analysis.Common/Analysis/Cjk/package.md @@ -1,4 +1,4 @@ ---- +--- uid: Lucene.Net.Analysis.Cjk summary: *content --- @@ -23,4 +23,14 @@ summary: *content Analyzer for Chinese, Japanese, and Korean, which indexes bigrams. This analyzer generates bigram terms, which are overlapping groups of two adjacent Han, Hiragana, Katakana, or Hangul characters. - Three analyzers are provided for Chinese, each of which treats Chinese text in a different way. * ChineseAnalyzer (in the analyzers/cn package): Index unigrams (individual Chinese characters) as a token. * CJKAnalyzer (in this package): Index bigrams (overlapping groups of two adjacent Chinese characters) as tokens. * SmartChineseAnalyzer (in the analyzers/smartcn package): Index words (attempt to segment Chinese text into words) as tokens. Example phrase: "我是中国人" 1. ChineseAnalyzer: 我- [...] \ No newline at end of file + Three analyzers are provided for Chinese, each of which treats Chinese text in a different way. + +* ChineseAnalyzer (in the Lucene.Net.Analysis.Cn namespace): Index unigrams (individual Chinese characters) as a token. +* CJKAnalyzer (in the Lucene.Net.Analysis.Cjk namespace): Index bigrams (overlapping groups of two adjacent Chinese characters) as tokens. +* SmartChineseAnalyzer (in the Lucene.Net.Analysis.SmartCn package): Index words (attempt to segment Chinese text into words) as tokens. + +Example phrase: "我是中国人" + +1. ChineseAnalyzer: 我-是-中-国-人 +2. CJKAnalyzer: 我是-是中-中国-国人 +3. SmartChineseAnalyzer: 我-是-中国-人 \ No newline at end of file diff --git a/src/Lucene.Net.Analysis.Common/Analysis/Cn/package.md b/src/Lucene.Net.Analysis.Common/Analysis/Cn/package.md index a0a8f74..aa00067 100644 --- a/src/Lucene.Net.Analysis.Common/Analysis/Cn/package.md +++ b/src/Lucene.Net.Analysis.Common/Analysis/Cn/package.md @@ -1,4 +1,4 @@ ---- +--- uid: Lucene.Net.Analysis.Cn summary: *content --- @@ -22,4 +22,15 @@ summary: *content Analyzer for Chinese, which indexes unigrams (individual chinese characters). - Three analyzers are provided for Chinese, each of which treats Chinese text in a different way. * StandardAnalyzer: Index unigrams (individual Chinese characters) as a token. * CJKAnalyzer (in the analyzers/cjk package): Index bigrams (overlapping groups of two adjacent Chinese characters) as tokens. * SmartChineseAnalyzer (in the analyzers/smartcn package): Index words (attempt to segment Chinese text into words) as tokens. Example phrase: "我是中国人" 1. StandardAnalyzer: 我-是-中-国-人 2. CJKA [...] \ No newline at end of file + Three analyzers are provided for Chinese, each of which treats Chinese text in a different way. + +* StandardAnalyzer: Index unigrams (individual Chinese characters) as a token. +* CJKAnalyzer (in the Lucene.Net.Analysis.Cjk namespace): Index bigrams (overlapping groups of two adjacent Chinese characters) as tokens. +* SmartChineseAnalyzer (in the Lucene.Net.Analysis.SmartCn package): Index words (attempt to segment Chinese text into words) as tokens. + + +Example phrase: "我是中国人" + +1. StandardAnalyzer: 我-是-中-国-人 +2. CJKAnalyzer: 我是-是中-中国-国人 +3. SmartChineseAnalyzer: 我-是-中国-人 \ No newline at end of file diff --git a/src/Lucene.Net.Analysis.Common/Analysis/Compound/package.md b/src/Lucene.Net.Analysis.Common/Analysis/Compound/package.md index a56f95c..6560069 100644 --- a/src/Lucene.Net.Analysis.Common/Analysis/Compound/package.md +++ b/src/Lucene.Net.Analysis.Common/Analysis/Compound/package.md @@ -1,4 +1,4 @@ ---- +--- uid: Lucene.Net.Analysis.Compound summary: *content --- @@ -147,56 +147,77 @@ This decision matrix should help you: ### Examples - public void testHyphenationCompoundWordsDE() throws Exception { - String[] dict = { "Rind", "Fleisch", "Draht", "Schere", "Gesetz", - "Aufgabe", "Überwachung" }; - - Reader reader = new FileReader("de_DR.xml"); - - HyphenationTree hyphenator = HyphenationCompoundWordTokenFilter - .getHyphenationTree(reader); - +```cs +const LuceneVersion matchVersion = LuceneVersion.LUCENE_48; + +[Test] +public void TestHyphenationCompoundWordsDE() +{ + string[] dictionary = { + "Rind", "Fleisch", "Draht", "Schere", "Gesetz", + "Aufgabe", "Überwachung" }; + + using Stream stream = new FileStream("de_DR.xml", FileMode.Open); + + HyphenationTree hyphenator = HyphenationCompoundWordTokenFilter.GetHyphenationTree(stream); + HyphenationCompoundWordTokenFilter tf = new HyphenationCompoundWordTokenFilter( - new WhitespaceTokenizer(new StringReader( - "Rindfleischüberwachungsgesetz Drahtschere abba")), hyphenator, - dict, CompoundWordTokenFilterBase.DEFAULT_MIN_WORD_SIZE, + matchVersion + + new WhitespaceTokenizer( + matchVersion, + new StringReader("Rindfleischüberwachungsgesetz Drahtschere abba")), + hyphenator, + dictionary, + CompoundWordTokenFilterBase.DEFAULT_MIN_WORD_SIZE, CompoundWordTokenFilterBase.DEFAULT_MIN_SUBWORD_SIZE, - CompoundWordTokenFilterBase.DEFAULT_MAX_SUBWORD_SIZE, false); + CompoundWordTokenFilterBase.DEFAULT_MAX_SUBWORD_SIZE, + onlyLongestMatch: false); + + ICharTermAttribute t = tf.AddAttribute<ICharTermAttribute>(); + while (tf.IncrementToken()) + { + Console.WriteLine(t); + } +} + +[Test] +public void TestHyphenationCompoundWordsWithoutDictionaryDE() +{ + using Stream stream = new FileStream("de_DR.xml", FileMode.Open); + + HyphenationTree hyphenator = HyphenationCompoundWordTokenFilter.GetHyphenationTree(stream); - CharTermAttribute t = tf.addAttribute(CharTermAttribute.class); - while (tf.incrementToken()) { - System.out.println(t); - } - } - - public void testHyphenationCompoundWordsWithoutDictionaryDE() throws Exception { - Reader reader = new FileReader("de_DR.xml"); - - HyphenationTree hyphenator = HyphenationCompoundWordTokenFilter - .getHyphenationTree(reader); - HyphenationCompoundWordTokenFilter tf = new HyphenationCompoundWordTokenFilter( - new WhitespaceTokenizer(new StringReader( - "Rindfleischüberwachungsgesetz Drahtschere abba")), hyphenator); - - CharTermAttribute t = tf.addAttribute(CharTermAttribute.class); - while (tf.incrementToken()) { - System.out.println(t); - } - } - - public void testDumbCompoundWordsSE() throws Exception { - String[] dict = { "Bil", "Dörr", "Motor", "Tak", "Borr", "Slag", "Hammar", - "Pelar", "Glas", "Ögon", "Fodral", "Bas", "Fiol", "Makare", "Gesäll", - "Sko", "Vind", "Rute", "Torkare", "Blad" }; - + new WhitespaceTokenizer(matchVersion, + new StringReader("Rindfleischüberwachungsgesetz Drahtschere abba")), + hyphenator); + + ICharTermAttribute t = tf.AddAttribute<ICharTermAttribute>(); + while (tf.IncrementToken()) + { + Console.WriteLine(t); + } +} + +[Test] +public void TestDumbCompoundWordsSE() +{ + string[] dictionary = { + "Bil", "Dörr", "Motor", "Tak", "Borr", "Slag", "Hammar", + "Pelar", "Glas", "Ögon", "Fodral", "Bas", "Fiol", "Makare", "Gesäll", + "Sko", "Vind", "Rute", "Torkare", "Blad" }; + DictionaryCompoundWordTokenFilter tf = new DictionaryCompoundWordTokenFilter( - new WhitespaceTokenizer( - new StringReader( - "Bildörr Bilmotor Biltak Slagborr Hammarborr Pelarborr Glasögonfodral Basfiolsfodral Basfiolsfodralmakaregesäll Skomakare Vindrutetorkare Vindrutetorkarblad abba")), - dict); - CharTermAttribute t = tf.addAttribute(CharTermAttribute.class); - while (tf.incrementToken()) { - System.out.println(t); - } - } \ No newline at end of file + new WhitespaceTokenizer( + matchVersion, + new StringReader( + "Bildörr Bilmotor Biltak Slagborr Hammarborr Pelarborr Glasögonfodral Basfiolsfodral Basfiolsfodralmakaregesäll Skomakare Vindrutetorkare Vindrutetorkarblad abba")), + dictionary); + ICharTermAttribute t = tf.AddAttribute<ICharTermAttribute>(); + while (tf.IncrementToken()) + { + Console.WriteLine(t); + } +} +``` \ No newline at end of file diff --git a/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std31/package.md b/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std31/package.md index 8901325..3894608 100644 --- a/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std31/package.md +++ b/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std31/package.md @@ -1,4 +1,4 @@ ---- +--- uid: Lucene.Net.Analysis.Standard.Std31 summary: *content --- @@ -20,4 +20,4 @@ summary: *content limitations under the License. --> -Backwards-compatible implementation to match [#LUCENE_31](xref:Lucene.Net.Util.Version) \ No newline at end of file +Backwards-compatible implementation to match [LuceneVersion.LUCENE_31](xref:Lucene.Net.Util.LuceneVersion#Lucene_Net_Util_LuceneVersion_LUCENE_31) \ No newline at end of file diff --git a/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std34/package.md b/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std34/package.md index 0c8297f..c49592d 100644 --- a/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std34/package.md +++ b/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std34/package.md @@ -1,4 +1,4 @@ ---- +--- uid: Lucene.Net.Analysis.Standard.Std34 summary: *content --- @@ -20,4 +20,4 @@ summary: *content limitations under the License. --> -Backwards-compatible implementation to match [#LUCENE_34](xref:Lucene.Net.Util.Version) \ No newline at end of file +Backwards-compatible implementation to match [LuceneVersion.LUCENE_34](xref:Lucene.Net.Util.LuceneVersion#Lucene_Net_Util_LuceneVersion_LUCENE_34) \ No newline at end of file diff --git a/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std36/package.md b/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std36/package.md index c12cfaa..50b81da 100644 --- a/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std36/package.md +++ b/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std36/package.md @@ -1,4 +1,4 @@ ---- +--- uid: Lucene.Net.Analysis.Standard.Std36 summary: *content --- @@ -20,4 +20,4 @@ summary: *content limitations under the License. --> -Backwards-compatible implementation to match [#LUCENE_36](xref:Lucene.Net.Util.Version) \ No newline at end of file +Backwards-compatible implementation to match [LuceneVersion.LUCENE_36](xref:Lucene.Net.Util.LuceneVersion#Lucene_Net_Util_LuceneVersion_LUCENE_36) \ No newline at end of file diff --git a/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std40/package.md b/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std40/package.md index 62c466b..50f60cb 100644 --- a/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std40/package.md +++ b/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std40/package.md @@ -1,4 +1,4 @@ ---- +--- uid: Lucene.Net.Analysis.Standard.Std40 summary: *content --- @@ -20,4 +20,4 @@ summary: *content limitations under the License. --> -Backwards-compatible implementation to match [#LUCENE_40](xref:Lucene.Net.Util.Version) \ No newline at end of file +Backwards-compatible implementation to match [LuceneVersion.LUCENE_40](xref:Lucene.Net.Util.LuceneVersion#Lucene_Net_Util_LuceneVersion_LUCENE_40) \ No newline at end of file diff --git a/src/Lucene.Net.Analysis.Common/Analysis/Standard/package.md b/src/Lucene.Net.Analysis.Common/Analysis/Standard/package.md index 9a3f1d1..d2202e3 100644 --- a/src/Lucene.Net.Analysis.Common/Analysis/Standard/package.md +++ b/src/Lucene.Net.Analysis.Common/Analysis/Standard/package.md @@ -1,4 +1,4 @@ ---- +--- uid: Lucene.Net.Analysis.Standard summary: *content --- @@ -22,43 +22,41 @@ summary: *content Fast, general-purpose grammar-based tokenizers. -The `org.apache.lucene.analysis.standard` package contains three fast grammar-based tokenizers constructed with JFlex: - -* <xref:Lucene.Net.Analysis.Standard.StandardTokenizer>: - as of Lucene 3.1, implements the Word Break rules from the Unicode Text - Segmentation algorithm, as specified in - [Unicode Standard Annex #29](http://unicode.org/reports/tr29/). - Unlike `UAX29URLEmailTokenizer`, URLs and email addresses are - __not__ tokenized as single tokens, but are instead split up into - tokens according to the UAX#29 word break rules. - - [StandardAnalyzer](xref:Lucene.Net.Analysis.Standard.StandardAnalyzer) includes - [StandardTokenizer](xref:Lucene.Net.Analysis.Standard.StandardTokenizer), - [StandardFilter](xref:Lucene.Net.Analysis.Standard.StandardFilter), - [LowerCaseFilter](xref:Lucene.Net.Analysis.Core.LowerCaseFilter) - and [StopFilter](xref:Lucene.Net.Analysis.Core.StopFilter). - When the `Version` specified in the constructor is lower than - 3.1, the [ClassicTokenizer](xref:Lucene.Net.Analysis.Standard.ClassicTokenizer) - implementation is invoked. - -* [ClassicTokenizer](xref:Lucene.Net.Analysis.Standard.ClassicTokenizer): - this class was formerly (prior to Lucene 3.1) named - `StandardTokenizer`. (Its tokenization rules are not - based on the Unicode Text Segmentation algorithm.) - [ClassicAnalyzer](xref:Lucene.Net.Analysis.Standard.ClassicAnalyzer) includes - [ClassicTokenizer](xref:Lucene.Net.Analysis.Standard.ClassicTokenizer), - [StandardFilter](xref:Lucene.Net.Analysis.Standard.StandardFilter), - [LowerCaseFilter](xref:Lucene.Net.Analysis.Core.LowerCaseFilter) - and [StopFilter](xref:Lucene.Net.Analysis.Core.StopFilter). - -* [UAX29URLEmailTokenizer](xref:Lucene.Net.Analysis.Standard.UAX29URLEmailTokenizer): - implements the Word Break rules from the Unicode Text Segmentation - algorithm, as specified in - [Unicode Standard Annex #29](http://unicode.org/reports/tr29/). - URLs and email addresses are also tokenized according to the relevant RFCs. - - [UAX29URLEmailAnalyzer](xref:Lucene.Net.Analysis.Standard.UAX29URLEmailAnalyzer) includes - [UAX29URLEmailTokenizer](xref:Lucene.Net.Analysis.Standard.UAX29URLEmailTokenizer), - [StandardFilter](xref:Lucene.Net.Analysis.Standard.StandardFilter), - [LowerCaseFilter](xref:Lucene.Net.Analysis.Core.LowerCaseFilter) - and [StopFilter](xref:Lucene.Net.Analysis.Core.StopFilter). \ No newline at end of file +The <xref:Lucene.Net.Analysis.Standard> package contains three fast grammar-based tokenizers constructed with JFlex: + +* <xref:Lucene.Net.Analysis.Standard.StandardTokenizer>: + as of Lucene 3.1, implements the Word Break rules from the Unicode Text + Segmentation algorithm, as specified in + [Unicode Standard Annex #29](http://unicode.org/reports/tr29/). + Unlike `UAX29URLEmailTokenizer`, URLs and email addresses are + __not__ tokenized as single tokens, but are instead split up into + tokens according to the UAX#29 word break rules.<br/><br/> + [StandardAnalyzer](xref:Lucene.Net.Analysis.Standard.StandardAnalyzer) includes + [StandardTokenizer](xref:Lucene.Net.Analysis.Standard.StandardTokenizer), + [StandardFilter](xref:Lucene.Net.Analysis.Standard.StandardFilter), + [LowerCaseFilter](xref:Lucene.Net.Analysis.Core.LowerCaseFilter) + and [StopFilter](xref:Lucene.Net.Analysis.Core.StopFilter). + When the `LuceneVersion` specified in the constructor is lower than + 3.1, the [ClassicTokenizer](xref:Lucene.Net.Analysis.Standard.ClassicTokenizer) + implementation is invoked. + +* [ClassicTokenizer](xref:Lucene.Net.Analysis.Standard.ClassicTokenizer): + this class was formerly (prior to Lucene 3.1) named + `StandardTokenizer`. (Its tokenization rules are not + based on the Unicode Text Segmentation algorithm.) + [ClassicAnalyzer](xref:Lucene.Net.Analysis.Standard.ClassicAnalyzer) includes + [ClassicTokenizer](xref:Lucene.Net.Analysis.Standard.ClassicTokenizer), + [StandardFilter](xref:Lucene.Net.Analysis.Standard.StandardFilter), + [LowerCaseFilter](xref:Lucene.Net.Analysis.Core.LowerCaseFilter) + and [StopFilter](xref:Lucene.Net.Analysis.Core.StopFilter). + +* [UAX29URLEmailTokenizer](xref:Lucene.Net.Analysis.Standard.UAX29URLEmailTokenizer): + implements the Word Break rules from the Unicode Text Segmentation + algorithm, as specified in + [Unicode Standard Annex #29](http://unicode.org/reports/tr29/). + URLs and email addresses are also tokenized according to the relevant RFCs.<br/><br/> + [UAX29URLEmailAnalyzer](xref:Lucene.Net.Analysis.Standard.UAX29URLEmailAnalyzer) includes + [UAX29URLEmailTokenizer](xref:Lucene.Net.Analysis.Standard.UAX29URLEmailTokenizer), + [StandardFilter](xref:Lucene.Net.Analysis.Standard.StandardFilter), + [LowerCaseFilter](xref:Lucene.Net.Analysis.Core.LowerCaseFilter) + and [StopFilter](xref:Lucene.Net.Analysis.Core.StopFilter). \ No newline at end of file
