CJK Tokenizer in NLS fails to stop at end of input buffer.
----------------------------------------------------------

         Key: LUCENENET-5
         URL: http://issues.apache.org/jira/browse/LUCENENET-5
     Project: Lucene.NET
        Type: Bug

 Environment: lucene.net.nls.1.3.2.2 on .NET 1.1 SP1
    Reporter: Ben Tregenna
    Priority: Minor


When using the CJKTokenizer from the National Language Support Pack to tokenize 
simple Japanese text, the tokenizer fails to indicate EOS correctly. 
Example code snippet (suitable for use as an nUnit test):

public void SimpleTokenization()
{
        TextReader tr = new StringReader("日本国");
        CJKTokenizer tokenizer = new CJKTokenizer(tr);
        Assert.AreEqual("日本", tokenizer.Next().TermText(), "First Token is 
correct");
        Assert.AreEqual("本国", tokenizer.Next().TermText(), "Second Token is 
correct");
        Assert.AreEqual(string.Empty, tokenizer.Next().TermText(), "Returns 
empty string as final token");
        Assert.IsNull(tokenizer.Next(), "Returns null after end of string");
}

The current code treats the final buffer as circular and so returns as a third 
token "国日" and then keeps return these three tokens cyclically. The problem 
comes from the condition for checking EOS from the TextReader input. In Java, 
Reader.read() returns -1 on EOS but in .NET TextReader.Read returns 0 on EOS 
and so the terminating condition needs altering. 

The diff to fix is pretty trivial:
CJKTokenizer.cs: 162c162
<                               if (dataLen == -1)
---
>                               if (dataLen == 0)


As a final note to the unwary - the comment at the start of the 
CJKTokenizer.Next() seems to indicate that null will be returned immediately at 
EOS "Returns the next token in the stream, or null at EOS." However I always 
get an empty token then null as indicated in the snippet above. The logic now 
seems to reflect the lucene-java logic exactly so whether this is a bug, a 
feature or a poor method summary remains unclear to me.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to