Re: [PR] feat: Optimize SmartCn Dictionaries and Add Dictionary Loading Tests [lucenenet]

via GitHub Thu, 24 Apr 2025 09:12:47 -0700


NightOwl888 commented on code in PR #1154:
URL: https://github.com/apache/lucenenet/pull/1154#discussion_r2058764874



##########
src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs:
##########
@@ -286,37 +304,37 @@ public virtual void LoadFromFile(string dctFilePath)
                 int j = 0;
                 while (j < cnt)
                 {
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    buffer[0] = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian)
-                        .GetInt32();// frequency
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    buffer[1] = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian)
-                        .GetInt32();// length
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    // buffer[2] = ByteBuffer.wrap(intBuffer).order(
-                    // ByteOrder.LITTLE_ENDIAN).getInt();// handle
+                    // LUCENENET: Use BinaryReader to decode little endian 
instead of ByteBuffer, since this is the default in .NET
+                    buffer[0] = reader.ReadInt32(); // frequency
+                    buffer[1] = reader.ReadInt32(); // length
+                    buffer[2] = reader.ReadInt32(); // Skip handle value 
(unused)
 
                     length = buffer[1];
-                    if (length > 0)
+                    if (length > 0 && length <= MAX_VALID_LENGTH && 
dctFile.Position + length <= dctFile.Length)

Review Comment:
   What is MAX_VALID_LENGTH value of 1000 based on? I don't see it upstream and 
it isn't clear why there is a max limitation of 1000 here.
   
   Ideally we would leave the same condition (`length > 0)` here as it was 
upstream so we get the same failure conditions, but if there is some reasoning 
behind this change that isn't commented, please explain.



##########
src/Lucene.Net.Analysis.SmartCn/Hhmm/WordDictionary.cs:
##########
@@ -340,62 +340,61 @@ private void SaveToObj(FileInfo serialObj)
         /// <summary>
         /// Load the datafile into this <see cref="WordDictionary"/>
         /// </summary>
-        /// <param name="dctFilePath">path to word dictionary 
(coredict.dct)</param>
-        /// <returns>number of words read</returns>
+        /// <param name="dctFilePath">Path to word dictionary 
(coredict.dct)</param>
+        /// <returns>Number of words read</returns>
         /// <exception cref="IOException">If there is a low-level I/O 
error.</exception>
         private int LoadMainDataFromFile(string dctFilePath)
         {
             int i, cnt, length, total = 0;
-            // The file only counted 6763 Chinese characters plus 5 reserved 
slots 3756~3760.
+
+            // The file only counted 6763 Chinese characters plus 5 reserved 
slots (3756~3760).
             // The 3756th is used (as a header) to store information.
-            int[]
-            buffer = new int[3];
-            byte[] intBuffer = new byte[4];
-            string tmpword;

Review Comment:
   Please declare `tmpword` after `buffer` below, not inline.



##########
src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs:
##########
@@ -254,30 +254,48 @@ private void Load(string dictRoot)
         /// <summary>
         /// Load the datafile into this <see cref="BigramDictionary"/>
         /// </summary>
-        /// <param name="dctFilePath">dctFilePath path to the Bigramdictionary 
(bigramdict.dct)</param>
+        /// <param name="dctFilePath">Path to the Bigramdictionary 
(bigramdict.dct)</param>
         /// <exception cref="IOException">If there is a low-level I/O 
error</exception>
         public virtual void LoadFromFile(string dctFilePath)
         {
             int i, cnt, length, total = 0;
+
             // The file only counted 6763 Chinese characters plus 5 reserved 
slots 3756~3760.
             // The 3756th is used (as a header) to store information.
-            int[]
-            buffer = new int[3];
-            byte[] intBuffer = new byte[4];
+
+            Span<int> buffer = stackalloc int[3];
             string tmpword;
+
+            // LUCENENET: Removed buffer and intBuffer arrays since 
BinaryReader handles reading values directly in a more type-safe and readable 
way.

Review Comment:
   Please update this comment to say `intBuffer array`, since the `buffer` 
variable has been restored.



##########
src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs:
##########
@@ -254,80 +254,83 @@ private void Load(string dictRoot)
         /// <summary>
         /// Load the datafile into this <see cref="BigramDictionary"/>
         /// </summary>
-        /// <param name="dctFilePath">dctFilePath path to the Bigramdictionary 
(bigramdict.dct)</param>
+        /// <param name="dctFilePath">Path to the Bigramdictionary 
(bigramdict.dct)</param>
         /// <exception cref="IOException">If there is a low-level I/O 
error</exception>
         public virtual void LoadFromFile(string dctFilePath)
         {
-            int i, cnt, length, total = 0;
             // The file only counted 6763 Chinese characters plus 5 reserved 
slots 3756~3760.
             // The 3756th is used (as a header) to store information.
-            int[]
-            buffer = new int[3];
-            byte[] intBuffer = new byte[4];
-            string tmpword;
+
+            // LUCENENET: Removed buffer and intBuffer arrays since 
BinaryReader handles reading values directly in a more type-safe and readable 
way.
+            // LUCENENET specific - refactored constants for clarity
+            const int HEADER_POSITION = 3755;
+            const int MAX_VALID_LENGTH = 1000;
+
             //using (RandomAccessFile dctFile = new 
RandomAccessFile(dctFilePath, "r"))
             using var dctFile = new FileStream(dctFilePath, FileMode.Open, 
FileAccess.Read);
+            using var reader = new BinaryReader(dctFile);
 
             // GB2312 characters 0 - 6768
-            for (i = GB2312_FIRST_CHAR; i < GB2312_FIRST_CHAR + 
CHAR_NUM_IN_FILE; i++)
+            for (int i = GB2312_FIRST_CHAR; i < GB2312_FIRST_CHAR + 
CHAR_NUM_IN_FILE; i++)
             {
-                string currentStr = GetCCByGB2312Id(i);
-                // if (i == 5231)
-                // System.out.println(i);
 
-                dctFile.Read(intBuffer, 0, intBuffer.Length);
-                // the dictionary was developed for C, and byte order must be 
converted to work with Java
-                cnt = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian).GetInt32();
+                string currentStr = GetCCByGB2312Id(i); 
+                int cnt;
+                try
+                {
+                   cnt = reader.ReadInt32();  // LUCENENET: Use BinaryReader 
methods instead of ByteBuffer
+                }
+                catch (EndOfStreamException)

Review Comment:
   Given that the length is hard coded as a constant instead of being inferred 
or read from the file, I guess we can leave this here. It seems like the file 
must contain exactly this set of words and nothing more or less, but it can be 
customized to change the frequencies. Since it is like this upstream, I guess 
it is fine for now.



##########
src/Lucene.Net.Tests.Analysis.SmartCn/Hhmm/TestBuildDictionary.cs:
##########
@@ -0,0 +1,110 @@
+using J2N;
+using Lucene.Net.Analysis.Cn.Smart;
+using Lucene.Net.Analysis.Cn.Smart.Hhmm;
+using Lucene.Net.Attributes;
+using Lucene.Net.Util;
+using NUnit.Framework;
+using System;
+using System.IO;
+using Assert = Lucene.Net.TestFramework.Assert;
+
+namespace Lucene.Net.Analysis.Cn.Smart.Hhmm
+{
+    /*
+     * Licensed to the Apache Software Foundation (ASF) under one or more
+     * contributor license agreements.  See the NOTICE file distributed with
+     * this work for additional information regarding copyright ownership.
+     * The ASF licenses this file to You under the Apache License, Version 2.0
+     * (the "License"); you may not use this file except in compliance with
+     * the License.  You may obtain a copy of the License at
+     *
+     *     http://www.apache.org/licenses/LICENSE-2.0
+     *
+     * Unless required by applicable law or agreed to in writing, software
+     * distributed under the License is distributed on an "AS IS" BASIS,
+     * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+     * See the License for the specific language governing permissions and
+     * limitations under the License.
+     */
+
+    [LuceneNetSpecific]
+    public class TestBuildDictionary : LuceneTestCase
+    {
+        private DirectoryInfo tempDir;
+
+        public override void OneTimeSetUp()
+        {
+            base.OneTimeSetUp();
+            tempDir = CreateTempDir("smartcn-data");
+            AnalyzerProfile.ANALYSIS_DATA_DIR = tempDir.FullName;
+            using (var zipFileStream = 
typeof(TestBuildDictionary).FindAndGetManifestResourceStream("custom-dictionary-input.zip"))
+            {
+                TestUtil.Unzip(zipFileStream, tempDir);
+            }
+        }
+
+        public override void OneTimeTearDown()
+        {
+            AnalyzerProfile.ANALYSIS_DATA_DIR = null; // Ensure this test data 
is not loaded for other tests
+            base.OneTimeTearDown();
+        }
+
+        [Test]
+        public void TestBigramDictionary()
+        {
+            // First test - builds and loads dictionary from .dict file
+            BigramDictionary bigramDict = BigramDictionary.GetInstance();
+            CheckBigramDictionary(bigramDict);
+            
+            // Ensure .mem file was created
+            string memFile = System.IO.Path.Combine(tempDir.FullName,  
"bigramdict.mem");
+            Assert.IsTrue(File.Exists(memFile), "Memory file should be created 
after first load");
+            
+            // Delete the original .dict file
+            string dictFile = System.IO.Path.Combine(tempDir.FullName,  
"bigramdict.dct");
+            if (File.Exists(dictFile))

Review Comment:
   Rather than putting an if block here, make this an assert. We want the test 
to fail if the file isn't in the temp directory.



##########
src/Lucene.Net.Analysis.SmartCn/Hhmm/WordDictionary.cs:
##########
@@ -340,62 +340,61 @@ private void SaveToObj(FileInfo serialObj)
         /// <summary>
         /// Load the datafile into this <see cref="WordDictionary"/>
         /// </summary>
-        /// <param name="dctFilePath">path to word dictionary 
(coredict.dct)</param>
-        /// <returns>number of words read</returns>
+        /// <param name="dctFilePath">Path to word dictionary 
(coredict.dct)</param>
+        /// <returns>Number of words read</returns>
         /// <exception cref="IOException">If there is a low-level I/O 
error.</exception>
         private int LoadMainDataFromFile(string dctFilePath)
         {
             int i, cnt, length, total = 0;
-            // The file only counted 6763 Chinese characters plus 5 reserved 
slots 3756~3760.
+
+            // The file only counted 6763 Chinese characters plus 5 reserved 
slots (3756~3760).
             // The 3756th is used (as a header) to store information.
-            int[]
-            buffer = new int[3];
-            byte[] intBuffer = new byte[4];
-            string tmpword;
+
+            Span<int> buffer = stackalloc int[3];
+
+            // LUCENENET: Removed buffer and intBuffer arrays since 
BinaryReader handles reading values directly in a more type-safe and readable 
way.

Review Comment:
   Please update this comment to say `intBuffer array`, since the `buffer` 
variable has been restored.



##########
src/Lucene.Net.Tests.Analysis.SmartCn/Hhmm/TestBuildDictionary.cs:
##########
@@ -0,0 +1,110 @@
+using J2N;
+using Lucene.Net.Analysis.Cn.Smart;
+using Lucene.Net.Analysis.Cn.Smart.Hhmm;
+using Lucene.Net.Attributes;
+using Lucene.Net.Util;
+using NUnit.Framework;
+using System;
+using System.IO;
+using Assert = Lucene.Net.TestFramework.Assert;
+
+namespace Lucene.Net.Analysis.Cn.Smart.Hhmm
+{
+    /*
+     * Licensed to the Apache Software Foundation (ASF) under one or more
+     * contributor license agreements.  See the NOTICE file distributed with
+     * this work for additional information regarding copyright ownership.
+     * The ASF licenses this file to You under the Apache License, Version 2.0
+     * (the "License"); you may not use this file except in compliance with
+     * the License.  You may obtain a copy of the License at
+     *
+     *     http://www.apache.org/licenses/LICENSE-2.0
+     *
+     * Unless required by applicable law or agreed to in writing, software
+     * distributed under the License is distributed on an "AS IS" BASIS,
+     * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+     * See the License for the specific language governing permissions and
+     * limitations under the License.
+     */
+
+    [LuceneNetSpecific]
+    public class TestBuildDictionary : LuceneTestCase
+    {
+        private DirectoryInfo tempDir;
+
+        public override void OneTimeSetUp()
+        {
+            base.OneTimeSetUp();
+            tempDir = CreateTempDir("smartcn-data");
+            AnalyzerProfile.ANALYSIS_DATA_DIR = tempDir.FullName;
+            using (var zipFileStream = 
typeof(TestBuildDictionary).FindAndGetManifestResourceStream("custom-dictionary-input.zip"))
+            {
+                TestUtil.Unzip(zipFileStream, tempDir);
+            }
+        }
+
+        public override void OneTimeTearDown()
+        {
+            AnalyzerProfile.ANALYSIS_DATA_DIR = null; // Ensure this test data 
is not loaded for other tests
+            base.OneTimeTearDown();
+        }
+
+        [Test]
+        public void TestBigramDictionary()
+        {
+            // First test - builds and loads dictionary from .dict file
+            BigramDictionary bigramDict = BigramDictionary.GetInstance();
+            CheckBigramDictionary(bigramDict);
+            
+            // Ensure .mem file was created
+            string memFile = System.IO.Path.Combine(tempDir.FullName,  
"bigramdict.mem");
+            Assert.IsTrue(File.Exists(memFile), "Memory file should be created 
after first load");
+            
+            // Delete the original .dict file
+            string dictFile = System.IO.Path.Combine(tempDir.FullName,  
"bigramdict.dct");
+            if (File.Exists(dictFile))
+            {
+                File.Delete(dictFile);
+            }
+            
+            // Second test - should load from .mem file now
+            bigramDict = BigramDictionary.GetInstance();
+            CheckBigramDictionary(bigramDict);
+        }
+        
+        private static void CheckBigramDictionary(BigramDictionary bigramDict)
+        {
+            Assert.AreEqual(10, bigramDict.GetFrequency("啊hello".AsSpan()), 
"Frequency for '啊hello' is incorrect.");
+            Assert.AreEqual(20, bigramDict.GetFrequency("阿world".AsSpan()), 
"Frequency for '阿world' is incorrect.");
+        }
+
+        [Test]
+        public void TestWordDictionary()
+        {
+            // First test - builds and loads dictionary from .dict file
+            WordDictionary wordDict = WordDictionary.GetInstance();
+            CheckWordDictionary(wordDict);
+            
+            // Ensure .mem file was created
+            string memFile = System.IO.Path.Combine(tempDir.FullName, 
"coredict.mem");
+            Assert.IsTrue(File.Exists(memFile), "Memory file should be created 
after first load");
+            
+            // Delete the original .dict file
+            string dictFile = System.IO.Path.Combine(tempDir.FullName, 
"coredict.dct");
+            if (File.Exists(dictFile))

Review Comment:
   Rather than putting an if block here, make this an assert. We want the test 
to fail if the file isn't in the temp directory.



##########
src/Lucene.Net.Analysis.SmartCn/Hhmm/AbstractDictionary.cs:
##########
@@ -162,7 +162,7 @@ public virtual long Hash1(char c)
         /// </summary>

Review Comment:
   Please fix the indentation so it looks the same as the above example (4 
spaces inside of `protected`).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] feat: Optimize SmartCn Dictionaries and Add Dictionary Loading Tests [lucenenet]

Reply via email to