Re: [PR] feat: Optimize SmartCn Dictionaries and Add Dictionary Loading Tests [lucenenet]

via GitHub Thu, 24 Apr 2025 00:52:41 -0700


NightOwl888 commented on code in PR #1154:
URL: https://github.com/apache/lucenenet/pull/1154#discussion_r2057772664



##########
src/Lucene.Net.Tests.Analysis.SmartCn/DictionaryTests.cs:
##########
@@ -0,0 +1,72 @@
+using Lucene.Net.Util;
+using Lucene.Net.Analysis.Cn.Smart.Hhmm;
+using Lucene.Net.Attributes;
+using NUnit.Framework;
+using System;
+using System.IO;
+using System.Reflection;
+
+
+[TestFixture]
+[LuceneNetSpecific]
+public class DictionaryTests : LuceneTestCase

Review Comment:
   @NehanPathan 
   
   Unfortunately, I am unable to determine where you are going wrong without 
seeing the code, but I suspect you weren't changing the namespace to end in 
`Hhmm`, which is required to match the `Hhmm` folder for the resource files. I 
created [a working 
demo](https://github.com/apache/lucenenet/compare/master...NightOwl888:lucenenet:demo/smartcn-custom-dictionary-loading)
 that you can use to determine what the issue is with your approach.
   
   Note that I discovered another missing piece to the puzzle - 
`OneTimeTearDown()` is required to nullify the `ANALYSIS_DATA_DIR` for the 
other tests because this static field will last the lifetime of the AppDomain. 
Setting it to `null` ensures the other tests will use the tables that are 
embedded in `Lucene.Net.Analysis.SmartCn` rather than this new directory.
   
   I also discovered that there is some problem with the data provided in the 
`coredict.dct` file you provided. I downloaded the one from 
[LUCENE-1629](https://issues.apache.org/jira/browse/LUCENE-1629) to confirm the 
logic in the parser works with the original data format (using the code on the 
master branch). I will be checking the tests with the original files to ensure 
the business logic can still load them after these changes.
   
   Note that the `coredict.dct` file is the smaller of the two, and if we zip 
them using the original file should be small enough for the test. If you don't 
care to dig into why the new test `coredict.dct` is not working correctly, it 
would be acceptable to use the original one provided both files are zipped (it 
is about 1.5 MB and zipping it will reduce that even further). All that would 
be required is to come up with some new test conditions to ensure the data 
loaded correctly and to create a new `custom-dictionary-input.zip` with the new 
`bigramdict.dct` file (from this PR) and the original `coredict.dct` file (from 
LUCENE-1629) and to place it in the same directory as the 
`TestBuildDictionary.cs` file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] feat: Optimize SmartCn Dictionaries and Add Dictionary Loading Tests [lucenenet]

Reply via email to