GitHub user laimis opened a pull request:

    https://github.com/apache/lucenenet/pull/31

    different approach to recasing test strings

    This fixes a bunch of test failures with tests in Lucene.Net.Analysis that 
look something like this:
    
    Lengths are not the same: mixedUp = 䧢ൄ, length = 3, sb = 񤧢ൄ, 
length = 4
      Expected: True
      But was:  False
    
    Lengths are not the same: mixedUp = ۀ迥ȟ, length = 6, sb = 
ۀ񘿥Ȟ, length = 7
      Expected: True
      But was:  False
    
    The issue is with a code path that is invoked only in certain conditions 
from here: 
https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.TestFramework/Util/TestUtil.cs#L1415
    
    When debugging the tests, it became apparent that the chars were "dropped" 
in the re-casing function 
(https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.TestFramework/Util/TestUtil.cs#L707).
 If the randomly generated string contained Unicode surrogate pairs, the 
resulting string was shorter and did not correctly match the source string 
(besides the casing part).
    
    The drop happens because casts of the code point that represents a 
surrogate pairs to char (eg. (char)codePoint) is inaccurate, the value is too 
large for 16 bits. The conversion ends up changing the re-cased char with a 
completely different character. And then only one character gets appended to 
the destination string builder when the source used two chars (high and low 
surrogate), thus the resulting string is shorter by one or more chars 
(depending on how many surrogate pairs were present in the source string).
    
    I looked at the Java implementation 
(https://github.com/apache/lucene-solr/blob/lucene_solr_4_8_0/lucene/test-framework/src/java/org/apache/lucene/util/TestUtil.java#L543)
 and it makes sense why it works there. Java's Character toUpper / toLower 
handle code points appropriately, without casting them to chars. It then also 
uses StringBuilder's appendCodePoint and not append to make sure that code 
point for surrogate pairs results in two characters being appended.
    
    I considered two options for the implementation. One which contained 
extension method for StringBuilder for AppendCodePoint which would add two 
appropriate chars for surrogate pair code point but then still had issues with 
how to appropriately deal with to Lower / Upper methods for all the possible 
code points. Java's Character version has implementation for properly selecting 
unicode planes that the code point belongs to and each plane has an appropriate 
implementation of those methods. Such functionality does not appear to be 
exposed in .NET, at least publicly.
    
    So I settled on the implementation that you can see in this branch. 
Basically it detects if the next char pair is a surrogate pair and then takes 
them both for possible casing function and appending to the string builder. All 
tests in Analysis pass after this change.
    
    I renamed the function to appropriately reflect that code points are not 
used anymore.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/laimis/lucenenet recase_fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/lucenenet/pull/31.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #31
    
----
commit 6cfc4fe83ef1a1958e311b34fb86a159782ef2e5
Author: Laimonas Simutis <[email protected]>
Date:   2014-12-29T02:24:12Z

    different approach to recasing test strings

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to