NightOwl888 commented on issue #732: URL: https://github.com/apache/lucenenet/issues/732#issuecomment-1300962409
Thanks. I am having trouble getting my VM up and running for Java debugging, but since I can read Thai, I reviewed the test and came up with a theory. The original test is: ```c# AssertAnalyzesTo(a, "กล่องใส่รองเท้า ใส่ของอเนกประสงค์เปิดฝาด้านหน้า เนื้อพลาสติกแข็งชนิดเดียวกับแฟ้มหูหิ้ว", new string[] { "กล่อง", "ใส่", "รองเท้า", "ใส่", "อเนกประสงค์", "ฝา", "หน้า", "เนื้อ", "พลาสติก", "แข็ง", "ชนิด", "แฟ้ม", "หู", "หิ้ว" }); ``` The test passes if you put all of the words in it that are in the input: ```c# AssertAnalyzesTo(a, "กล่องใส่รองเท้า ใส่ของอเนกประสงค์เปิดฝาด้านหน้า เนื้อพลาสติกแข็งชนิดเดียวกับแฟ้มหูหิ้ว", new string[] { "กล่อง", "ใส่", "รองเท้า", "ใส่", "ของ", "อเนกประสงค์", "เปิด", "ฝา", "ด้าน", "หน้า", "เนื้อ", "พลาสติก", "แข็ง", "ชนิด", "เดียว", "กับ", "แฟ้ม", "หู", "หิ้ว" }); ``` The words that are being excluded are: - ของ (things/items) - เปิด (open) - ด้าน (side/area) - เดียว (also) - กับ (with) These appear to be common stop words. One thing to note: a tokenizer is only 1 component of an analyzer. I suspect you have a `StopFilter` in the analyzer you are using that does not exist in the [analyzer for the test](https://github.com/apache/lucenenet/blob/c076e40b14d4c20e6fdfee4e28d0b3332cf6d0ce/src/Lucene.Net.Tests.Analysis.ICU/Analysis/Icu/Segmentation/TestICUTokenizer.cs#L76-L81). > **SIDE NOTE:** There does appear to be a discrepancy in that the tests indicate they are ported from 7.1.0 but the production code indicates it is ported from 8.6.1. I need to check, but it is entirely possible that this was just because we reviewed the production code and it hadn't changed between 7.1.0 and 8.6.1, but we should have done the tests as well. However, the [8.6.1 analyzer](https://github.com/apache/lucene/blob/releases/lucene-solr/8.6.1/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/segmentation/TestICUTokenizer.java#L73-L79) is different from the [7.1.0 analyzer](https://github.com/apache/lucene/blob/releases/lucene-solr/8.6.1/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/segmentation/TestICUTokenizer.java#L73-L79), which is what we are currently testing with. There is an extra `ICUNormalizer2Filter` in the 7.1.0 version of the test. In any case, make sure your analyzer is built from the same components in both envirnoments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org