GitHub user laimis opened a pull request:
https://github.com/apache/lucenenet/pull/31
different approach to recasing test strings
This fixes a bunch of test failures with tests in Lucene.Net.Analysis that
look something like this:
Lengths are not the same: mixedUp = ä§¢àµ, length = 3, sb = ñ¤§¢àµ,
length = 4
Expected: True
But was: False
Lengths are not the same: mixedUp = Ûè¿¥îï¥È, length = 6, sb =
Ûñ¿¥îï¥È, length = 7
Expected: True
But was: False
The issue is with a code path that is invoked only in certain conditions
from here:
https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.TestFramework/Util/TestUtil.cs#L1415
When debugging the tests, it became apparent that the chars were "dropped"
in the re-casing function
(https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.TestFramework/Util/TestUtil.cs#L707).
If the randomly generated string contained Unicode surrogate pairs, the
resulting string was shorter and did not correctly match the source string
(besides the casing part).
The drop happens because casts of the code point that represents a
surrogate pairs to char (eg. (char)codePoint) is inaccurate, the value is too
large for 16 bits. The conversion ends up changing the re-cased char with a
completely different character. And then only one character gets appended to
the destination string builder when the source used two chars (high and low
surrogate), thus the resulting string is shorter by one or more chars
(depending on how many surrogate pairs were present in the source string).
I looked at the Java implementation
(https://github.com/apache/lucene-solr/blob/lucene_solr_4_8_0/lucene/test-framework/src/java/org/apache/lucene/util/TestUtil.java#L543)
and it makes sense why it works there. Java's Character toUpper / toLower
handle code points appropriately, without casting them to chars. It then also
uses StringBuilder's appendCodePoint and not append to make sure that code
point for surrogate pairs results in two characters being appended.
I considered two options for the implementation. One which contained
extension method for StringBuilder for AppendCodePoint which would add two
appropriate chars for surrogate pair code point but then still had issues with
how to appropriately deal with to Lower / Upper methods for all the possible
code points. Java's Character version has implementation for properly selecting
unicode planes that the code point belongs to and each plane has an appropriate
implementation of those methods. Such functionality does not appear to be
exposed in .NET, at least publicly.
So I settled on the implementation that you can see in this branch.
Basically it detects if the next char pair is a surrogate pair and then takes
them both for possible casing function and appending to the string builder. All
tests in Analysis pass after this change.
I renamed the function to appropriately reflect that code points are not
used anymore.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/laimis/lucenenet recase_fix
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/lucenenet/pull/31.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #31
----
commit 6cfc4fe83ef1a1958e311b34fb86a159782ef2e5
Author: Laimonas Simutis <[email protected]>
Date: 2014-12-29T02:24:12Z
different approach to recasing test strings
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---