[GitHub] [lucenenet] NightOwl888 commented on issue #732: ICUTokenizer discrepancies

GitBox Wed, 02 Nov 2022 10:20:58 -0700


NightOwl888 commented on issue #732:
URL: https://github.com/apache/lucenenet/issues/732#issuecomment-1300962409


   Thanks.
   
   I am having trouble getting my VM up and running for Java debugging, but 
since I can read Thai, I reviewed the test and came up with a theory.
   
   The original test is:
   
   ```c#
   AssertAnalyzesTo(a, "กล่องใส่รองเท้า ใส่ของอเนกประสงค์เปิดฝาด้านหน้า 
เนื้อพลาสติกแข็งชนิดเดียวกับแฟ้มหูหิ้ว",
                   new string[] { "กล่อง", "ใส่", "รองเท้า", "ใส่", 
"อเนกประสงค์", "ฝา", "หน้า", "เนื้อ", "พลาสติก", "แข็ง", "ชนิด", "แฟ้ม", "หู", 
"หิ้ว" });
   ```
   
   The test passes if you put all of the words in it that are in the input:
   
   ```c#
   AssertAnalyzesTo(a, "กล่องใส่รองเท้า ใส่ของอเนกประสงค์เปิดฝาด้านหน้า 
เนื้อพลาสติกแข็งชนิดเดียวกับแฟ้มหูหิ้ว",
                   new string[] { "กล่อง", "ใส่", "รองเท้า", "ใส่", "ของ", 
"อเนกประสงค์", "เปิด", "ฝา", "ด้าน", "หน้า", "เนื้อ", "พลาสติก", "แข็ง", 
"ชนิด", "เดียว", "กับ", "แฟ้ม", "หู", "หิ้ว" });
   ```
   
   The words that are being excluded are:
   
   - ของ (things/items)
   - เปิด (open)
   - ด้าน (side/area)
   - เดียว (also)
   - กับ (with)
   
   These appear to be common stop words. One thing to note: a tokenizer is only 
1 component of an analyzer. I suspect you have a `StopFilter` in the analyzer 
you are using that does not exist in the [analyzer for the 
test](https://github.com/apache/lucenenet/blob/c076e40b14d4c20e6fdfee4e28d0b3332cf6d0ce/src/Lucene.Net.Tests.Analysis.ICU/Analysis/Icu/Segmentation/TestICUTokenizer.cs#L76-L81).
   
   > **SIDE NOTE:** There does appear to be a discrepancy in that the tests 
indicate they are ported from 7.1.0 but the production code indicates it is 
ported from 8.6.1. I need to check, but it is entirely possible that this was 
just because we reviewed the production code and it hadn't changed between 
7.1.0 and 8.6.1, but we should have done the tests as well. However, the [8.6.1 
analyzer](https://github.com/apache/lucene/blob/releases/lucene-solr/8.6.1/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/segmentation/TestICUTokenizer.java#L73-L79)
 is different from the [7.1.0 
analyzer](https://github.com/apache/lucene/blob/releases/lucene-solr/8.6.1/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/segmentation/TestICUTokenizer.java#L73-L79),
 which is what we are currently testing with. There is an extra 
`ICUNormalizer2Filter` in the 7.1.0 version of the test.
   
   In any case, make sure your analyzer is built from the same components in 
both envirnoments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [lucenenet] NightOwl888 commented on issue #732: ICUTokenizer discrepancies

Reply via email to