I recently downloaded the latest 4.5 from github
https://github.com/apache/lucenenet/ and started playing around with lucene.
When I ran some of the test y noticed a weird behavior with
RandomlyRecaseCodePoints method on the TestUtil class “TestUtil.cs”.
The test seems to generate random text and sometimes y got weird behavior with
some special string that may be invalid strings.
The error seems to on these lines
case 0:
builder.Append(char.ToUpper((char)codePoint));
break;
case 1:
builder.Append(char.ToLower((char)codePoint));
break;
case 2: // leave intact
builder.Append((char)codePoint);
break;
the (char)codePoint seems to truncate the integer codepoint so you get the
wrong result back and the test fails because the length of the txt is not the
same.
I don’t get this behavior when y run the same text with the java version of
Lucene (RandomlyRecaseCodePoints).
I made a quick fix and this code seems to fix the problem but I haven’t tested
it completely.
var stringValue = char.ConvertFromUtf32(codePoint);
switch (NextInt(random, 0, 2))
{
case 0:
var value0 =
stringValue.ToUpper();
builder.Append(value0);
break;
case 1:
var value1 =
stringValue.ToUpper().ToLower();
builder.Append(value1);
break;
case 2: // leave intact
builder.Append(stringValue);
break;
}
The text y got when running the test was hex F2 BA 81 B2 20
I made a bin file and added those hex number with a hexeditor was the only way
to repeatable test the same “incorrect” string.
(I attached the file y used on this mail “failedString.bin”)
Then y read the text File.ReadAllText with Linqpad and tested the
RandomlyRecaseCodePoints method with the string.
Has anyone else noticed this problem ??
Juan Orellana
System developer
Gustavslundsvägen 12
+46 (0)8 566 229 942
[email protected]
NORDIC NETPRODUCTS AB
Box 14113, SE-167 14 Bromma
+46 (0)8 566 229 00
www.nordicnet.se | www.largestcompanies.se