Re: [PR] feat: Optimize SmartCn Dictionaries and Add Dictionary Loading Tests [lucenenet]

via GitHub Mon, 07 Apr 2025 21:58:19 -0700


paulirwin commented on code in PR #1154:
URL: https://github.com/apache/lucenenet/pull/1154#discussion_r2032292567



##########
src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs:
##########
@@ -254,80 +254,84 @@ private void Load(string dictRoot)
         /// <summary>
         /// Load the datafile into this <see cref="BigramDictionary"/>
         /// </summary>
-        /// <param name="dctFilePath">dctFilePath path to the Bigramdictionary 
(bigramdict.dct)</param>
+        /// <param name="dctFilePath">Path to the Bigramdictionary 
(bigramDict.dct)</param>
         /// <exception cref="IOException">If there is a low-level I/O 
error</exception>
         public virtual void LoadFromFile(string dctFilePath)
         {
-            int i, cnt, length, total = 0;
-            // The file only counted 6763 Chinese characters plus 5 reserved 
slots 3756~3760.
-            // The 3756th is used (as a header) to store information.
-            int[]
-            buffer = new int[3];
-            byte[] intBuffer = new byte[4];
-            string tmpword;
-            //using (RandomAccessFile dctFile = new 
RandomAccessFile(dctFilePath, "r"))
+            // Position of special header entry in the file structure
+            const int HEADER_POSITION = 3755;
+            // Maximum valid length for word entries to prevent loading 
corrupted data
+            const int MAX_VALID_LENGTH = 1000;
+
+            // Open file for reading in binary mode
             using var dctFile = new FileStream(dctFilePath, FileMode.Open, 
FileAccess.Read);
+            using var reader = new BinaryReader(dctFile);
 
-            // GB2312 characters 0 - 6768
-            for (i = GB2312_FIRST_CHAR; i < GB2312_FIRST_CHAR + 
CHAR_NUM_IN_FILE; i++)
+            try

Review Comment:
   I think we should remove this try/catch, as it is not in the upstream code. 
If the file is corrupt and we attempt to read past the end, we should probably 
let that be an exception instead of being swallowed. But let me know if I'm 
misunderstanding and if that is necessary.



##########
src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs:
##########
@@ -254,80 +254,84 @@ private void Load(string dictRoot)
         /// <summary>
         /// Load the datafile into this <see cref="BigramDictionary"/>
         /// </summary>
-        /// <param name="dctFilePath">dctFilePath path to the Bigramdictionary 
(bigramdict.dct)</param>
+        /// <param name="dctFilePath">Path to the Bigramdictionary 
(bigramDict.dct)</param>
         /// <exception cref="IOException">If there is a low-level I/O 
error</exception>
         public virtual void LoadFromFile(string dctFilePath)
         {
-            int i, cnt, length, total = 0;
-            // The file only counted 6763 Chinese characters plus 5 reserved 
slots 3756~3760.
-            // The 3756th is used (as a header) to store information.
-            int[]
-            buffer = new int[3];
-            byte[] intBuffer = new byte[4];
-            string tmpword;
-            //using (RandomAccessFile dctFile = new 
RandomAccessFile(dctFilePath, "r"))
+            // Position of special header entry in the file structure

Review Comment:
   These two constants are not in the upstream code, so we should add a comment 
here like `// LUCENENET specific - refactored constants for clarity` 



##########
src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs:
##########
@@ -254,80 +254,84 @@ private void Load(string dictRoot)
         /// <summary>
         /// Load the datafile into this <see cref="BigramDictionary"/>
         /// </summary>
-        /// <param name="dctFilePath">dctFilePath path to the Bigramdictionary 
(bigramdict.dct)</param>
+        /// <param name="dctFilePath">Path to the Bigramdictionary 
(bigramDict.dct)</param>
         /// <exception cref="IOException">If there is a low-level I/O 
error</exception>
         public virtual void LoadFromFile(string dctFilePath)
         {
-            int i, cnt, length, total = 0;
-            // The file only counted 6763 Chinese characters plus 5 reserved 
slots 3756~3760.
-            // The 3756th is used (as a header) to store information.
-            int[]
-            buffer = new int[3];
-            byte[] intBuffer = new byte[4];
-            string tmpword;
-            //using (RandomAccessFile dctFile = new 
RandomAccessFile(dctFilePath, "r"))
+            // Position of special header entry in the file structure
+            const int HEADER_POSITION = 3755;
+            // Maximum valid length for word entries to prevent loading 
corrupted data
+            const int MAX_VALID_LENGTH = 1000;
+
+            // Open file for reading in binary mode
             using var dctFile = new FileStream(dctFilePath, FileMode.Open, 
FileAccess.Read);
+            using var reader = new BinaryReader(dctFile);
 
-            // GB2312 characters 0 - 6768
-            for (i = GB2312_FIRST_CHAR; i < GB2312_FIRST_CHAR + 
CHAR_NUM_IN_FILE; i++)
+            try
             {
-                string currentStr = GetCCByGB2312Id(i);
-                // if (i == 5231)
-                // System.out.println(i);
-
-                dctFile.Read(intBuffer, 0, intBuffer.Length);
-                // the dictionary was developed for C, and byte order must be 
converted to work with Java
-                cnt = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian).GetInt32();
-                if (cnt <= 0)
-                {
-                    continue;
-                }
-                total += cnt;
-                int j = 0;
-                while (j < cnt)
+                // Iterate through all GB2312 characters in the valid range
+                for (int i = GB2312_FIRST_CHAR; i < GB2312_FIRST_CHAR + 
CHAR_NUM_IN_FILE; i++)
                 {
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    buffer[0] = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian)
-                        .GetInt32();// frequency
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    buffer[1] = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian)
-                        .GetInt32();// length
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    // buffer[2] = ByteBuffer.wrap(intBuffer).order(
-                    // ByteOrder.LITTLE_ENDIAN).getInt();// handle
-
-                    length = buffer[1];
-                    if (length > 0)
+                    // Get the current Chinese character
+                    string currentStr = GetCCByGB2312Id(i);
+                    // Read the count of words starting with this character
+                    int cnt = reader.ReadInt32();
+
+                    // Skip if no words start with this character
+                    if (cnt <= 0) continue;
+
+                    // Process all words for the current character
+                    for (int j = 0; j < cnt; j++)
                     {
-                        byte[] lchBuffer = new byte[length];
-                        dctFile.Read(lchBuffer, 0, lchBuffer.Length);
-                        //tmpword = new String(lchBuffer, "GB2312");
-                        tmpword = gb2312Encoding.GetString(lchBuffer); // 
LUCENENET specific: use cached encoding instance from base class
-                        //tmpword = 
Encoding.GetEncoding("hz-gb-2312").GetString(lchBuffer);
-                        if (i != 3755 + GB2312_FIRST_CHAR)
-                        {
-                            tmpword = currentStr + tmpword;
-                        }
-                        char[] carray = tmpword.ToCharArray();
-                        long hashId = Hash1(carray);
-                        int index = GetAvaliableIndex(hashId, carray);
-                        if (index != -1)
+                        // Read word metadata
+                        int frequency = reader.ReadInt32();  // How often this 
word appears
+                        int length = reader.ReadInt32();     // Length of the 
word in bytes
+                        reader.ReadInt32();                  // Skip handle 
value (unused)
+
+                        // Validate word length and ensure we don't read past 
the file end
+                        if (length > 0 && length <= MAX_VALID_LENGTH && 
dctFile.Position + length <= dctFile.Length)
                         {
-                            if (bigramHashTable[index] == 0)
+                            // Read the word bytes and convert to string
+                            byte[] lchBuffer = reader.ReadBytes(length);
+                            string tmpword = 
gb2312Encoding.GetString(lchBuffer);
+
+                            // For regular entries (not header entries), 
prepend the current character
+                            if (i != HEADER_POSITION + GB2312_FIRST_CHAR)
+                            {
+                                tmpword = currentStr + tmpword;
+                            }
+
+                            // Create a span for efficient string handling
+                            ReadOnlySpan<char> carray = tmpword.AsSpan();
+                            // Generate hash for the word
+                            long hashId = Hash1(carray);
+                            // Find available slot in hash table
+                            int index = GetAvaliableIndex(hashId, carray);
+
+                            // Store word if a valid index was found
+                            if (index != -1)
                             {
-                                bigramHashTable[index] = hashId;
-                                // bigramStringTable[index] = tmpword;

Review Comment:
   Please restore this removed line, as it exists upstream



##########
src/Lucene.Net.Analysis.SmartCn/Hhmm/WordDictionary.cs:
##########
@@ -340,80 +340,70 @@ private void SaveToObj(FileInfo serialObj)
         /// <summary>
         /// Load the datafile into this <see cref="WordDictionary"/>
         /// </summary>
-        /// <param name="dctFilePath">path to word dictionary 
(coredict.dct)</param>
+        /// <param name="dctFilePath">path to word dictionary 
(coreDict.dct)</param>

Review Comment:
   Please revert casing change of file name (see other related comment)



##########
src/Lucene.Net.Tests.Analysis.SmartCn/DictionaryTests.cs:
##########
@@ -0,0 +1,75 @@
+using Lucene.Net.Util;
+using Lucene.Net.Analysis.Cn.Smart.Hhmm;
+using NUnit.Framework;
+using System;
+using System.IO;
+using System.Reflection;
+
+[TestFixture]
+public class DictionaryTests : LuceneTestCase
+{
+    private const string BigramResourceName = 
"Lucene.Net.Tests.Analysis.SmartCn.Resources.bigramDict.dct";
+
+    [Test, Category("Dictionary")]
+    public void TestBigramDictionary()
+    {
+        // Extract embedded resource
+        using var resourceStream = GetResourceStream(BigramResourceName);
+
+        // Copy to temp file
+        FileInfo _tempFile = CreateTempFile("bigramDict", ".dct");

Review Comment:
   See comment earlier re: casing. Use all lowercase casing for filename, and 
please rename the resources files to match.



##########
src/Lucene.Net.Tests.Analysis.SmartCn/DictionaryTests.cs:
##########
@@ -0,0 +1,75 @@
+using Lucene.Net.Util;
+using Lucene.Net.Analysis.Cn.Smart.Hhmm;
+using NUnit.Framework;
+using System;
+using System.IO;
+using System.Reflection;
+
+[TestFixture]

Review Comment:
   Please add a `[LuceneNetSpecific]` attribute since these tests do not exist 
upstream.



##########
src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs:
##########
@@ -254,80 +254,84 @@ private void Load(string dictRoot)
         /// <summary>
         /// Load the datafile into this <see cref="BigramDictionary"/>
         /// </summary>
-        /// <param name="dctFilePath">dctFilePath path to the Bigramdictionary 
(bigramdict.dct)</param>
+        /// <param name="dctFilePath">Path to the Bigramdictionary 
(bigramDict.dct)</param>
         /// <exception cref="IOException">If there is a low-level I/O 
error</exception>
         public virtual void LoadFromFile(string dctFilePath)
         {
-            int i, cnt, length, total = 0;
-            // The file only counted 6763 Chinese characters plus 5 reserved 
slots 3756~3760.
-            // The 3756th is used (as a header) to store information.
-            int[]
-            buffer = new int[3];
-            byte[] intBuffer = new byte[4];
-            string tmpword;
-            //using (RandomAccessFile dctFile = new 
RandomAccessFile(dctFilePath, "r"))
+            // Position of special header entry in the file structure
+            const int HEADER_POSITION = 3755;
+            // Maximum valid length for word entries to prevent loading 
corrupted data
+            const int MAX_VALID_LENGTH = 1000;
+
+            // Open file for reading in binary mode
             using var dctFile = new FileStream(dctFilePath, FileMode.Open, 
FileAccess.Read);
+            using var reader = new BinaryReader(dctFile);
 
-            // GB2312 characters 0 - 6768
-            for (i = GB2312_FIRST_CHAR; i < GB2312_FIRST_CHAR + 
CHAR_NUM_IN_FILE; i++)
+            try
             {
-                string currentStr = GetCCByGB2312Id(i);
-                // if (i == 5231)
-                // System.out.println(i);
-
-                dctFile.Read(intBuffer, 0, intBuffer.Length);
-                // the dictionary was developed for C, and byte order must be 
converted to work with Java
-                cnt = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian).GetInt32();
-                if (cnt <= 0)
-                {
-                    continue;
-                }
-                total += cnt;
-                int j = 0;
-                while (j < cnt)
+                // Iterate through all GB2312 characters in the valid range
+                for (int i = GB2312_FIRST_CHAR; i < GB2312_FIRST_CHAR + 
CHAR_NUM_IN_FILE; i++)
                 {
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    buffer[0] = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian)
-                        .GetInt32();// frequency
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    buffer[1] = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian)
-                        .GetInt32();// length
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    // buffer[2] = ByteBuffer.wrap(intBuffer).order(
-                    // ByteOrder.LITTLE_ENDIAN).getInt();// handle
-
-                    length = buffer[1];
-                    if (length > 0)
+                    // Get the current Chinese character
+                    string currentStr = GetCCByGB2312Id(i);
+                    // Read the count of words starting with this character
+                    int cnt = reader.ReadInt32();

Review Comment:
   Add a comment at the end of the line: `// LUCENENET: Use BinaryReader 
methods instead of ByteBuffer`



##########
src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs:
##########
@@ -254,80 +254,84 @@ private void Load(string dictRoot)
         /// <summary>
         /// Load the datafile into this <see cref="BigramDictionary"/>
         /// </summary>
-        /// <param name="dctFilePath">dctFilePath path to the Bigramdictionary 
(bigramdict.dct)</param>
+        /// <param name="dctFilePath">Path to the Bigramdictionary 
(bigramDict.dct)</param>

Review Comment:
   In the upstream Java code, the filename is all lowercase. On operating 
systems where the filesystem is case-sensitive, this could matter. We should 
revert the change to the casing of `bigramdict.dct` so that it matches upstream.



##########
src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs:
##########
@@ -254,80 +254,84 @@ private void Load(string dictRoot)
         /// <summary>
         /// Load the datafile into this <see cref="BigramDictionary"/>
         /// </summary>
-        /// <param name="dctFilePath">dctFilePath path to the Bigramdictionary 
(bigramdict.dct)</param>
+        /// <param name="dctFilePath">Path to the Bigramdictionary 
(bigramDict.dct)</param>
         /// <exception cref="IOException">If there is a low-level I/O 
error</exception>
         public virtual void LoadFromFile(string dctFilePath)
         {
-            int i, cnt, length, total = 0;
-            // The file only counted 6763 Chinese characters plus 5 reserved 
slots 3756~3760.
-            // The 3756th is used (as a header) to store information.
-            int[]
-            buffer = new int[3];
-            byte[] intBuffer = new byte[4];
-            string tmpword;
-            //using (RandomAccessFile dctFile = new 
RandomAccessFile(dctFilePath, "r"))
+            // Position of special header entry in the file structure
+            const int HEADER_POSITION = 3755;
+            // Maximum valid length for word entries to prevent loading 
corrupted data
+            const int MAX_VALID_LENGTH = 1000;
+
+            // Open file for reading in binary mode
             using var dctFile = new FileStream(dctFilePath, FileMode.Open, 
FileAccess.Read);
+            using var reader = new BinaryReader(dctFile);
 
-            // GB2312 characters 0 - 6768
-            for (i = GB2312_FIRST_CHAR; i < GB2312_FIRST_CHAR + 
CHAR_NUM_IN_FILE; i++)
+            try
             {
-                string currentStr = GetCCByGB2312Id(i);
-                // if (i == 5231)
-                // System.out.println(i);
-
-                dctFile.Read(intBuffer, 0, intBuffer.Length);
-                // the dictionary was developed for C, and byte order must be 
converted to work with Java
-                cnt = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian).GetInt32();
-                if (cnt <= 0)
-                {
-                    continue;
-                }
-                total += cnt;
-                int j = 0;
-                while (j < cnt)
+                // Iterate through all GB2312 characters in the valid range
+                for (int i = GB2312_FIRST_CHAR; i < GB2312_FIRST_CHAR + 
CHAR_NUM_IN_FILE; i++)
                 {
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    buffer[0] = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian)
-                        .GetInt32();// frequency
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    buffer[1] = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian)
-                        .GetInt32();// length
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    // buffer[2] = ByteBuffer.wrap(intBuffer).order(
-                    // ByteOrder.LITTLE_ENDIAN).getInt();// handle
-
-                    length = buffer[1];
-                    if (length > 0)
+                    // Get the current Chinese character

Review Comment:
   These comments are a bit excessive and do not exist upstream. I would 
suggest removing these explanatory comments.



##########
src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs:
##########
@@ -254,80 +254,84 @@ private void Load(string dictRoot)
         /// <summary>
         /// Load the datafile into this <see cref="BigramDictionary"/>
         /// </summary>
-        /// <param name="dctFilePath">dctFilePath path to the Bigramdictionary 
(bigramdict.dct)</param>
+        /// <param name="dctFilePath">Path to the Bigramdictionary 
(bigramDict.dct)</param>
         /// <exception cref="IOException">If there is a low-level I/O 
error</exception>
         public virtual void LoadFromFile(string dctFilePath)
         {
-            int i, cnt, length, total = 0;
-            // The file only counted 6763 Chinese characters plus 5 reserved 
slots 3756~3760.
-            // The 3756th is used (as a header) to store information.
-            int[]
-            buffer = new int[3];
-            byte[] intBuffer = new byte[4];
-            string tmpword;
-            //using (RandomAccessFile dctFile = new 
RandomAccessFile(dctFilePath, "r"))
+            // Position of special header entry in the file structure
+            const int HEADER_POSITION = 3755;
+            // Maximum valid length for word entries to prevent loading 
corrupted data
+            const int MAX_VALID_LENGTH = 1000;
+
+            // Open file for reading in binary mode
             using var dctFile = new FileStream(dctFilePath, FileMode.Open, 
FileAccess.Read);
+            using var reader = new BinaryReader(dctFile);
 
-            // GB2312 characters 0 - 6768
-            for (i = GB2312_FIRST_CHAR; i < GB2312_FIRST_CHAR + 
CHAR_NUM_IN_FILE; i++)
+            try
             {
-                string currentStr = GetCCByGB2312Id(i);
-                // if (i == 5231)
-                // System.out.println(i);
-
-                dctFile.Read(intBuffer, 0, intBuffer.Length);
-                // the dictionary was developed for C, and byte order must be 
converted to work with Java
-                cnt = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian).GetInt32();
-                if (cnt <= 0)
-                {
-                    continue;
-                }
-                total += cnt;
-                int j = 0;
-                while (j < cnt)
+                // Iterate through all GB2312 characters in the valid range
+                for (int i = GB2312_FIRST_CHAR; i < GB2312_FIRST_CHAR + 
CHAR_NUM_IN_FILE; i++)
                 {
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    buffer[0] = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian)
-                        .GetInt32();// frequency
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    buffer[1] = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian)
-                        .GetInt32();// length
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    // buffer[2] = ByteBuffer.wrap(intBuffer).order(
-                    // ByteOrder.LITTLE_ENDIAN).getInt();// handle
-
-                    length = buffer[1];
-                    if (length > 0)
+                    // Get the current Chinese character
+                    string currentStr = GetCCByGB2312Id(i);
+                    // Read the count of words starting with this character
+                    int cnt = reader.ReadInt32();
+
+                    // Skip if no words start with this character
+                    if (cnt <= 0) continue;

Review Comment:
   Please wrap in curly braces as it was previously



##########
src/Lucene.Net.Analysis.SmartCn/Hhmm/WordDictionary.cs:
##########
@@ -340,80 +340,70 @@ private void SaveToObj(FileInfo serialObj)
         /// <summary>
         /// Load the datafile into this <see cref="WordDictionary"/>
         /// </summary>
-        /// <param name="dctFilePath">path to word dictionary 
(coredict.dct)</param>
+        /// <param name="dctFilePath">path to word dictionary 
(coreDict.dct)</param>
         /// <returns>number of words read</returns>
         /// <exception cref="IOException">If there is a low-level I/O 
error.</exception>
         private int LoadMainDataFromFile(string dctFilePath)
         {
-            int i, cnt, length, total = 0;
-            // The file only counted 6763 Chinese characters plus 5 reserved 
slots 3756~3760.
-            // The 3756th is used (as a header) to store information.
-            int[]
-            buffer = new int[3];
-            byte[] intBuffer = new byte[4];
-            string tmpword;
+            // Counter for total number of words loaded
+            int total = 0;
+
+            // Open the dictionary file for binary reading
             using (var dctFile = new FileStream(dctFilePath, FileMode.Open, 
FileAccess.Read))
+            using (var reader = new BinaryReader(dctFile))
             {
-
-                // GB2312 characters 0 - 6768
-                for (i = GB2312_FIRST_CHAR; i < GB2312_FIRST_CHAR + 
CHAR_NUM_IN_FILE; i++)
+                // Process each Chinese character in the GB2312 encoding range
+                for (int i = GB2312_FIRST_CHAR; i < GB2312_FIRST_CHAR + 
CHAR_NUM_IN_FILE; i++)
                 {
-                    // if (i == 5231)
-                    // System.out.println(i);
+                    // Read number of words starting with this character
+                    int cnt = reader.ReadInt32();

Review Comment:
   Add `// LUCENENET: Use BinaryReader methods instead of ByteBuffer`



##########
src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs:
##########
@@ -254,80 +254,84 @@ private void Load(string dictRoot)
         /// <summary>
         /// Load the datafile into this <see cref="BigramDictionary"/>
         /// </summary>
-        /// <param name="dctFilePath">dctFilePath path to the Bigramdictionary 
(bigramdict.dct)</param>
+        /// <param name="dctFilePath">Path to the Bigramdictionary 
(bigramDict.dct)</param>
         /// <exception cref="IOException">If there is a low-level I/O 
error</exception>
         public virtual void LoadFromFile(string dctFilePath)
         {
-            int i, cnt, length, total = 0;
-            // The file only counted 6763 Chinese characters plus 5 reserved 
slots 3756~3760.
-            // The 3756th is used (as a header) to store information.
-            int[]
-            buffer = new int[3];
-            byte[] intBuffer = new byte[4];
-            string tmpword;
-            //using (RandomAccessFile dctFile = new 
RandomAccessFile(dctFilePath, "r"))
+            // Position of special header entry in the file structure
+            const int HEADER_POSITION = 3755;
+            // Maximum valid length for word entries to prevent loading 
corrupted data
+            const int MAX_VALID_LENGTH = 1000;
+
+            // Open file for reading in binary mode
             using var dctFile = new FileStream(dctFilePath, FileMode.Open, 
FileAccess.Read);
+            using var reader = new BinaryReader(dctFile);
 
-            // GB2312 characters 0 - 6768
-            for (i = GB2312_FIRST_CHAR; i < GB2312_FIRST_CHAR + 
CHAR_NUM_IN_FILE; i++)
+            try
             {
-                string currentStr = GetCCByGB2312Id(i);
-                // if (i == 5231)
-                // System.out.println(i);
-
-                dctFile.Read(intBuffer, 0, intBuffer.Length);
-                // the dictionary was developed for C, and byte order must be 
converted to work with Java
-                cnt = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian).GetInt32();
-                if (cnt <= 0)
-                {
-                    continue;
-                }
-                total += cnt;
-                int j = 0;
-                while (j < cnt)
+                // Iterate through all GB2312 characters in the valid range
+                for (int i = GB2312_FIRST_CHAR; i < GB2312_FIRST_CHAR + 
CHAR_NUM_IN_FILE; i++)
                 {
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    buffer[0] = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian)
-                        .GetInt32();// frequency
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    buffer[1] = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian)
-                        .GetInt32();// length
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    // buffer[2] = ByteBuffer.wrap(intBuffer).order(
-                    // ByteOrder.LITTLE_ENDIAN).getInt();// handle
-
-                    length = buffer[1];
-                    if (length > 0)
+                    // Get the current Chinese character
+                    string currentStr = GetCCByGB2312Id(i);
+                    // Read the count of words starting with this character
+                    int cnt = reader.ReadInt32();
+
+                    // Skip if no words start with this character
+                    if (cnt <= 0) continue;
+
+                    // Process all words for the current character
+                    for (int j = 0; j < cnt; j++)
                     {
-                        byte[] lchBuffer = new byte[length];
-                        dctFile.Read(lchBuffer, 0, lchBuffer.Length);
-                        //tmpword = new String(lchBuffer, "GB2312");
-                        tmpword = gb2312Encoding.GetString(lchBuffer); // 
LUCENENET specific: use cached encoding instance from base class
-                        //tmpword = 
Encoding.GetEncoding("hz-gb-2312").GetString(lchBuffer);
-                        if (i != 3755 + GB2312_FIRST_CHAR)
-                        {
-                            tmpword = currentStr + tmpword;
-                        }
-                        char[] carray = tmpword.ToCharArray();
-                        long hashId = Hash1(carray);
-                        int index = GetAvaliableIndex(hashId, carray);
-                        if (index != -1)
+                        // Read word metadata

Review Comment:
   Please remove explanatory comments (except for "Skip handle value (unused)" 
- that one is helpful) and add `// LUCENENET: Use BinaryReader methods instead 
of ByteBuffer` before these lines



##########
src/Lucene.Net.Analysis.SmartCn/Hhmm/WordDictionary.cs:
##########
@@ -340,80 +340,70 @@ private void SaveToObj(FileInfo serialObj)
         /// <summary>
         /// Load the datafile into this <see cref="WordDictionary"/>
         /// </summary>
-        /// <param name="dctFilePath">path to word dictionary 
(coredict.dct)</param>
+        /// <param name="dctFilePath">path to word dictionary 
(coreDict.dct)</param>
         /// <returns>number of words read</returns>
         /// <exception cref="IOException">If there is a low-level I/O 
error.</exception>
         private int LoadMainDataFromFile(string dctFilePath)
         {
-            int i, cnt, length, total = 0;
-            // The file only counted 6763 Chinese characters plus 5 reserved 
slots 3756~3760.
-            // The 3756th is used (as a header) to store information.
-            int[]
-            buffer = new int[3];
-            byte[] intBuffer = new byte[4];
-            string tmpword;
+            // Counter for total number of words loaded
+            int total = 0;
+
+            // Open the dictionary file for binary reading
             using (var dctFile = new FileStream(dctFilePath, FileMode.Open, 
FileAccess.Read))
+            using (var reader = new BinaryReader(dctFile))

Review Comment:
   Add `// LUCENENET: ` comment explaining why we're using BinaryReader here.



##########
src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs:
##########
@@ -254,80 +254,84 @@ private void Load(string dictRoot)
         /// <summary>
         /// Load the datafile into this <see cref="BigramDictionary"/>
         /// </summary>
-        /// <param name="dctFilePath">dctFilePath path to the Bigramdictionary 
(bigramdict.dct)</param>
+        /// <param name="dctFilePath">Path to the Bigramdictionary 
(bigramDict.dct)</param>
         /// <exception cref="IOException">If there is a low-level I/O 
error</exception>
         public virtual void LoadFromFile(string dctFilePath)
         {
-            int i, cnt, length, total = 0;
-            // The file only counted 6763 Chinese characters plus 5 reserved 
slots 3756~3760.
-            // The 3756th is used (as a header) to store information.
-            int[]
-            buffer = new int[3];
-            byte[] intBuffer = new byte[4];
-            string tmpword;
-            //using (RandomAccessFile dctFile = new 
RandomAccessFile(dctFilePath, "r"))
+            // Position of special header entry in the file structure
+            const int HEADER_POSITION = 3755;
+            // Maximum valid length for word entries to prevent loading 
corrupted data
+            const int MAX_VALID_LENGTH = 1000;
+
+            // Open file for reading in binary mode
             using var dctFile = new FileStream(dctFilePath, FileMode.Open, 
FileAccess.Read);
+            using var reader = new BinaryReader(dctFile);
 
-            // GB2312 characters 0 - 6768
-            for (i = GB2312_FIRST_CHAR; i < GB2312_FIRST_CHAR + 
CHAR_NUM_IN_FILE; i++)
+            try
             {
-                string currentStr = GetCCByGB2312Id(i);
-                // if (i == 5231)
-                // System.out.println(i);
-
-                dctFile.Read(intBuffer, 0, intBuffer.Length);
-                // the dictionary was developed for C, and byte order must be 
converted to work with Java
-                cnt = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian).GetInt32();
-                if (cnt <= 0)
-                {
-                    continue;
-                }
-                total += cnt;
-                int j = 0;
-                while (j < cnt)
+                // Iterate through all GB2312 characters in the valid range
+                for (int i = GB2312_FIRST_CHAR; i < GB2312_FIRST_CHAR + 
CHAR_NUM_IN_FILE; i++)
                 {
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    buffer[0] = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian)
-                        .GetInt32();// frequency
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    buffer[1] = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian)
-                        .GetInt32();// length
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    // buffer[2] = ByteBuffer.wrap(intBuffer).order(
-                    // ByteOrder.LITTLE_ENDIAN).getInt();// handle
-
-                    length = buffer[1];
-                    if (length > 0)
+                    // Get the current Chinese character
+                    string currentStr = GetCCByGB2312Id(i);
+                    // Read the count of words starting with this character
+                    int cnt = reader.ReadInt32();
+
+                    // Skip if no words start with this character
+                    if (cnt <= 0) continue;
+
+                    // Process all words for the current character
+                    for (int j = 0; j < cnt; j++)
                     {
-                        byte[] lchBuffer = new byte[length];
-                        dctFile.Read(lchBuffer, 0, lchBuffer.Length);
-                        //tmpword = new String(lchBuffer, "GB2312");

Review Comment:
   Please restore removed comments



##########
src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs:
##########
@@ -254,80 +254,84 @@ private void Load(string dictRoot)
         /// <summary>
         /// Load the datafile into this <see cref="BigramDictionary"/>
         /// </summary>
-        /// <param name="dctFilePath">dctFilePath path to the Bigramdictionary 
(bigramdict.dct)</param>
+        /// <param name="dctFilePath">Path to the Bigramdictionary 
(bigramDict.dct)</param>
         /// <exception cref="IOException">If there is a low-level I/O 
error</exception>
         public virtual void LoadFromFile(string dctFilePath)
         {
-            int i, cnt, length, total = 0;
-            // The file only counted 6763 Chinese characters plus 5 reserved 
slots 3756~3760.
-            // The 3756th is used (as a header) to store information.
-            int[]
-            buffer = new int[3];
-            byte[] intBuffer = new byte[4];
-            string tmpword;
-            //using (RandomAccessFile dctFile = new 
RandomAccessFile(dctFilePath, "r"))
+            // Position of special header entry in the file structure
+            const int HEADER_POSITION = 3755;
+            // Maximum valid length for word entries to prevent loading 
corrupted data
+            const int MAX_VALID_LENGTH = 1000;
+
+            // Open file for reading in binary mode
             using var dctFile = new FileStream(dctFilePath, FileMode.Open, 
FileAccess.Read);
+            using var reader = new BinaryReader(dctFile);
 
-            // GB2312 characters 0 - 6768
-            for (i = GB2312_FIRST_CHAR; i < GB2312_FIRST_CHAR + 
CHAR_NUM_IN_FILE; i++)
+            try
             {
-                string currentStr = GetCCByGB2312Id(i);
-                // if (i == 5231)
-                // System.out.println(i);
-
-                dctFile.Read(intBuffer, 0, intBuffer.Length);
-                // the dictionary was developed for C, and byte order must be 
converted to work with Java
-                cnt = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian).GetInt32();
-                if (cnt <= 0)
-                {
-                    continue;
-                }
-                total += cnt;
-                int j = 0;
-                while (j < cnt)
+                // Iterate through all GB2312 characters in the valid range
+                for (int i = GB2312_FIRST_CHAR; i < GB2312_FIRST_CHAR + 
CHAR_NUM_IN_FILE; i++)
                 {
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    buffer[0] = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian)
-                        .GetInt32();// frequency
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    buffer[1] = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian)
-                        .GetInt32();// length
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    // buffer[2] = ByteBuffer.wrap(intBuffer).order(
-                    // ByteOrder.LITTLE_ENDIAN).getInt();// handle
-
-                    length = buffer[1];
-                    if (length > 0)
+                    // Get the current Chinese character
+                    string currentStr = GetCCByGB2312Id(i);
+                    // Read the count of words starting with this character
+                    int cnt = reader.ReadInt32();
+
+                    // Skip if no words start with this character
+                    if (cnt <= 0) continue;
+
+                    // Process all words for the current character
+                    for (int j = 0; j < cnt; j++)
                     {
-                        byte[] lchBuffer = new byte[length];
-                        dctFile.Read(lchBuffer, 0, lchBuffer.Length);
-                        //tmpword = new String(lchBuffer, "GB2312");
-                        tmpword = gb2312Encoding.GetString(lchBuffer); // 
LUCENENET specific: use cached encoding instance from base class
-                        //tmpword = 
Encoding.GetEncoding("hz-gb-2312").GetString(lchBuffer);
-                        if (i != 3755 + GB2312_FIRST_CHAR)
-                        {
-                            tmpword = currentStr + tmpword;
-                        }
-                        char[] carray = tmpword.ToCharArray();
-                        long hashId = Hash1(carray);
-                        int index = GetAvaliableIndex(hashId, carray);
-                        if (index != -1)
+                        // Read word metadata
+                        int frequency = reader.ReadInt32();  // How often this 
word appears
+                        int length = reader.ReadInt32();     // Length of the 
word in bytes
+                        reader.ReadInt32();                  // Skip handle 
value (unused)
+
+                        // Validate word length and ensure we don't read past 
the file end
+                        if (length > 0 && length <= MAX_VALID_LENGTH && 
dctFile.Position + length <= dctFile.Length)
                         {
-                            if (bigramHashTable[index] == 0)
+                            // Read the word bytes and convert to string
+                            byte[] lchBuffer = reader.ReadBytes(length);

Review Comment:
   Add `// LUCENENET: Use BinaryReader methods instead of ByteBuffer`



##########
src/Lucene.Net.Analysis.SmartCn/Hhmm/WordDictionary.cs:
##########
@@ -340,80 +340,70 @@ private void SaveToObj(FileInfo serialObj)
         /// <summary>
         /// Load the datafile into this <see cref="WordDictionary"/>
         /// </summary>
-        /// <param name="dctFilePath">path to word dictionary 
(coredict.dct)</param>
+        /// <param name="dctFilePath">path to word dictionary 
(coreDict.dct)</param>
         /// <returns>number of words read</returns>
         /// <exception cref="IOException">If there is a low-level I/O 
error.</exception>
         private int LoadMainDataFromFile(string dctFilePath)
         {
-            int i, cnt, length, total = 0;
-            // The file only counted 6763 Chinese characters plus 5 reserved 
slots 3756~3760.
-            // The 3756th is used (as a header) to store information.
-            int[]
-            buffer = new int[3];
-            byte[] intBuffer = new byte[4];
-            string tmpword;
+            // Counter for total number of words loaded
+            int total = 0;
+
+            // Open the dictionary file for binary reading
             using (var dctFile = new FileStream(dctFilePath, FileMode.Open, 
FileAccess.Read))
+            using (var reader = new BinaryReader(dctFile))
             {
-
-                // GB2312 characters 0 - 6768
-                for (i = GB2312_FIRST_CHAR; i < GB2312_FIRST_CHAR + 
CHAR_NUM_IN_FILE; i++)
+                // Process each Chinese character in the GB2312 encoding range
+                for (int i = GB2312_FIRST_CHAR; i < GB2312_FIRST_CHAR + 
CHAR_NUM_IN_FILE; i++)
                 {
-                    // if (i == 5231)
-                    // System.out.println(i);
+                    // Read number of words starting with this character
+                    int cnt = reader.ReadInt32();
 
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    // the dictionary was developed for C, and byte order must 
be converted to work with Java
-                    cnt = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian).GetInt32();
+                    // If no words start with this character, set arrays to 
null and skip
                     if (cnt <= 0)
                     {
                         wordItem_charArrayTable[i] = null;
                         wordItem_frequencyTable[i] = null;
                         continue;
                     }
+
+                    // Initialize arrays to store words and their frequencies
                     wordItem_charArrayTable[i] = new char[cnt][];
                     wordItem_frequencyTable[i] = new int[cnt];
                     total += cnt;
-                    int j = 0;
-                    while (j < cnt)
+
+                    // Process each word for the current character
+                    for (int j = 0; j < cnt; j++)
                     {
-                        // wordItemTable[i][j] = new WordItem();
-                        dctFile.Read(intBuffer, 0, intBuffer.Length);
-                        buffer[0] = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian)
-                            .GetInt32();// frequency
-                        dctFile.Read(intBuffer, 0, intBuffer.Length);
-                        buffer[1] = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian)
-                            .GetInt32();// length
-                        dctFile.Read(intBuffer, 0, intBuffer.Length);
-                        buffer[2] = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian)
-                            .GetInt32();// handle
-
-                        // wordItemTable[i][j].frequency = buffer[0];
-                        wordItem_frequencyTable[i][j] = buffer[0];
-
-                        length = buffer[1];
+                        // Read word metadata

Review Comment:
   Same as in BigramDictionary, remove extra comments and add `// LUCENENET: 
Use BinaryReader methods instead of ByteBuffer`



##########
src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs:
##########
@@ -254,80 +254,84 @@ private void Load(string dictRoot)
         /// <summary>
         /// Load the datafile into this <see cref="BigramDictionary"/>
         /// </summary>
-        /// <param name="dctFilePath">dctFilePath path to the Bigramdictionary 
(bigramdict.dct)</param>
+        /// <param name="dctFilePath">Path to the Bigramdictionary 
(bigramDict.dct)</param>
         /// <exception cref="IOException">If there is a low-level I/O 
error</exception>
         public virtual void LoadFromFile(string dctFilePath)
         {
-            int i, cnt, length, total = 0;
-            // The file only counted 6763 Chinese characters plus 5 reserved 
slots 3756~3760.
-            // The 3756th is used (as a header) to store information.
-            int[]
-            buffer = new int[3];
-            byte[] intBuffer = new byte[4];
-            string tmpword;
-            //using (RandomAccessFile dctFile = new 
RandomAccessFile(dctFilePath, "r"))
+            // Position of special header entry in the file structure
+            const int HEADER_POSITION = 3755;
+            // Maximum valid length for word entries to prevent loading 
corrupted data
+            const int MAX_VALID_LENGTH = 1000;
+
+            // Open file for reading in binary mode
             using var dctFile = new FileStream(dctFilePath, FileMode.Open, 
FileAccess.Read);
+            using var reader = new BinaryReader(dctFile);
 
-            // GB2312 characters 0 - 6768
-            for (i = GB2312_FIRST_CHAR; i < GB2312_FIRST_CHAR + 
CHAR_NUM_IN_FILE; i++)
+            try
             {
-                string currentStr = GetCCByGB2312Id(i);
-                // if (i == 5231)
-                // System.out.println(i);
-
-                dctFile.Read(intBuffer, 0, intBuffer.Length);
-                // the dictionary was developed for C, and byte order must be 
converted to work with Java
-                cnt = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian).GetInt32();
-                if (cnt <= 0)
-                {
-                    continue;
-                }
-                total += cnt;
-                int j = 0;
-                while (j < cnt)
+                // Iterate through all GB2312 characters in the valid range
+                for (int i = GB2312_FIRST_CHAR; i < GB2312_FIRST_CHAR + 
CHAR_NUM_IN_FILE; i++)
                 {
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    buffer[0] = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian)
-                        .GetInt32();// frequency
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    buffer[1] = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian)
-                        .GetInt32();// length
-                    dctFile.Read(intBuffer, 0, intBuffer.Length);
-                    // buffer[2] = ByteBuffer.wrap(intBuffer).order(
-                    // ByteOrder.LITTLE_ENDIAN).getInt();// handle
-
-                    length = buffer[1];
-                    if (length > 0)
+                    // Get the current Chinese character
+                    string currentStr = GetCCByGB2312Id(i);
+                    // Read the count of words starting with this character
+                    int cnt = reader.ReadInt32();
+
+                    // Skip if no words start with this character
+                    if (cnt <= 0) continue;
+
+                    // Process all words for the current character
+                    for (int j = 0; j < cnt; j++)
                     {
-                        byte[] lchBuffer = new byte[length];
-                        dctFile.Read(lchBuffer, 0, lchBuffer.Length);
-                        //tmpword = new String(lchBuffer, "GB2312");
-                        tmpword = gb2312Encoding.GetString(lchBuffer); // 
LUCENENET specific: use cached encoding instance from base class
-                        //tmpword = 
Encoding.GetEncoding("hz-gb-2312").GetString(lchBuffer);
-                        if (i != 3755 + GB2312_FIRST_CHAR)
-                        {
-                            tmpword = currentStr + tmpword;
-                        }
-                        char[] carray = tmpword.ToCharArray();
-                        long hashId = Hash1(carray);
-                        int index = GetAvaliableIndex(hashId, carray);
-                        if (index != -1)
+                        // Read word metadata
+                        int frequency = reader.ReadInt32();  // How often this 
word appears
+                        int length = reader.ReadInt32();     // Length of the 
word in bytes
+                        reader.ReadInt32();                  // Skip handle 
value (unused)
+
+                        // Validate word length and ensure we don't read past 
the file end
+                        if (length > 0 && length <= MAX_VALID_LENGTH && 
dctFile.Position + length <= dctFile.Length)
                         {
-                            if (bigramHashTable[index] == 0)
+                            // Read the word bytes and convert to string
+                            byte[] lchBuffer = reader.ReadBytes(length);
+                            string tmpword = 
gb2312Encoding.GetString(lchBuffer);

Review Comment:
   This line had a comment that should not be removed: ` // LUCENENET specific: 
use cached encoding instance from base class`



##########
src/Lucene.Net.Analysis.SmartCn/Hhmm/WordDictionary.cs:
##########
@@ -340,80 +340,70 @@ private void SaveToObj(FileInfo serialObj)
         /// <summary>
         /// Load the datafile into this <see cref="WordDictionary"/>
         /// </summary>
-        /// <param name="dctFilePath">path to word dictionary 
(coredict.dct)</param>
+        /// <param name="dctFilePath">path to word dictionary 
(coreDict.dct)</param>
         /// <returns>number of words read</returns>
         /// <exception cref="IOException">If there is a low-level I/O 
error.</exception>
         private int LoadMainDataFromFile(string dctFilePath)
         {
-            int i, cnt, length, total = 0;

Review Comment:
   Same as in BigramDictionary, restore comments and explain which fields were 
removed and why.



##########
src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs:
##########
@@ -254,80 +254,84 @@ private void Load(string dictRoot)
         /// <summary>
         /// Load the datafile into this <see cref="BigramDictionary"/>
         /// </summary>
-        /// <param name="dctFilePath">dctFilePath path to the Bigramdictionary 
(bigramdict.dct)</param>
+        /// <param name="dctFilePath">Path to the Bigramdictionary 
(bigramDict.dct)</param>
         /// <exception cref="IOException">If there is a low-level I/O 
error</exception>
         public virtual void LoadFromFile(string dctFilePath)
         {
-            int i, cnt, length, total = 0;
-            // The file only counted 6763 Chinese characters plus 5 reserved 
slots 3756~3760.

Review Comment:
   Please do not remove existing comments, as they help us track changes over 
time and compare to upstream comments. Also please add a `// LUCENENET: ` 
comment explaining which variables were removed and why.



##########
src/Lucene.Net.Analysis.SmartCn/Hhmm/WordDictionary.cs:
##########
@@ -340,80 +340,70 @@ private void SaveToObj(FileInfo serialObj)
         /// <summary>
         /// Load the datafile into this <see cref="WordDictionary"/>
         /// </summary>
-        /// <param name="dctFilePath">path to word dictionary 
(coredict.dct)</param>
+        /// <param name="dctFilePath">path to word dictionary 
(coreDict.dct)</param>
         /// <returns>number of words read</returns>
         /// <exception cref="IOException">If there is a low-level I/O 
error.</exception>
         private int LoadMainDataFromFile(string dctFilePath)
         {
-            int i, cnt, length, total = 0;
-            // The file only counted 6763 Chinese characters plus 5 reserved 
slots 3756~3760.
-            // The 3756th is used (as a header) to store information.
-            int[]
-            buffer = new int[3];
-            byte[] intBuffer = new byte[4];
-            string tmpword;
+            // Counter for total number of words loaded
+            int total = 0;
+
+            // Open the dictionary file for binary reading
             using (var dctFile = new FileStream(dctFilePath, FileMode.Open, 
FileAccess.Read))
+            using (var reader = new BinaryReader(dctFile))
             {
-
-                // GB2312 characters 0 - 6768

Review Comment:
   Please restore original comments



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] feat: Optimize SmartCn Dictionaries and Add Dictionary Loading Tests [lucenenet]

Reply via email to