RE: Proposal to speed up implementation of LowercaseFilter/charUtils.ToLower

Van Den Berghe, Vincent Wed, 28 Dec 2016 01:54:23 -0800

I don't think that a per-character call to char.ToLowerInvariant is identical 
to a call to string.ToLowerInvariant() where the string is the character array 
to be converted. This only works if a character represents one code point.


A single char corresponds to exactly one code point only if you have characters 
from the  BMP (Basic Multilingual Pane). Otherwise, this code is invalid: a 
character may be part of a surrogate pair, for which the character(s) following 
it are part of the same code point. If you lowercase character per character, 
such cases may be incorrectly converted.

See 
https://msdn.microsoft.com/en-us/library/windows/desktop/dd374069(v=vs.85).aspx 
and 
https://msdn.microsoft.com/en-us/library/system.char.convertfromutf32(v=vs.110).aspx
 for some background on the matter. .NET calls two distinct API's for 
converting strings and characters: one is not implemented with the other.

You indirectly bring up an excellent point: my code assumes that the string 
constructed from the char[] is in canonical composition. I think this is the 
case, but I'm not that familiar with Lucene's handling of characters to be 
sure. Otherwise, one needs to insert Normalize(NormalizationForm.FormC) before 
the call to ToLowerInvariant(). But it may be overkill.


Vincent


-----Original Message-----
From: Oren Eini (Ayende Rahien) [mailto:[email protected]] 
Sent: Wednesday, December 28, 2016 10:27 AM
To: [email protected]
Subject: Re: Proposal to speed up implementation of 
LowercaseFilter/charUtils.ToLower

Why not?

public void ToLower(char[] buffer, int offset, int limit)
        {
            Debug.Assert(buffer.Length >= limit);
            Debug.Assert(offset <= 0 && offset <= buffer.Length);
            for (int i = offset; i < limit; i++)
            {
                buffer[i] = char.ToLowerInvariant(buffer[i]);
            }
        }

This does zero allocations all together.

Better yet, see:
https://github.com/ravendb/ravendb/blob/v4.0/src/Raven.Server/Documents/DocumentKeyWorker.cs#L35

for (var i = 0; i < key.Length; i++)
{
    var ch = key[i];
    if (ch > 127) // not ASCII, use slower mode
        goto UnlikelyUnicode;
    if ((ch >= 65) && (ch <= 90))
        buffer[i] = (byte) (ch | 0x20);
    else
        buffer[i] = (byte) ch;
}

For the common case of ASCII chars, this is much more efficient.


*Hibernating Rhinos Ltd  *

Oren Eini* l CEO l *Mobile: + 972-52-548-6969 <052-548-6969>

Office: +972-4-622-7811 <04-622-7811> *l *Fax: +972-153-4-622-7811



On Wed, Dec 28, 2016 at 10:46 AM, Van Den Berghe, Vincent < 
[email protected]> wrote:

> Hi,
>
> I've been doing performance measurements using the latest Lucene.net, 
> and profiling with the standard English analyzer (and all analyzers 
> with a lower case filter) indicates that a LOT of time is spent in
> LowerCaseFilter.IncrementToken() method, doing this:
>
> charUtils.ToLower(termAtt.Buffer(), 0, termAtt.Length);
>
> In my test cases, this dominates the execution time.
> The performance is horrible since inside charUtils.ToLower, for every 
> code point in the buffer a 1-integer array and a new string containing 
> the string representation of that code point are created, which is 
> subsequently lowercased and converted back:
>
> public static int ToLowerCase(int codePoint)
>     {
>       return Character.CodePointAt(UnicodeUtil.NewString(new int[1]
>       {
>         codePoint
>       }, 0, 1).ToLowerInvariant(), 0);
>     }
>
> This creates heap pressure (due to the huge amount of temporary int[1] 
> and string objects that fill up Gen0) and is highly inefficient 
> because of the inner loops for which the C# compiler isn't able to 
> eliminate the bounds checks.
> Yes, this is indeed what the Java code does, but in .NET the 
> ToLowerInvariant method already takes care of the correct Unicode 
> codepoints parsing, so I think we can replace the  charUtils.ToLower 
> method with the following implementation:
>
>         public void ToLower(char[] buffer, int offset, int limit)
>         {
>             Debug.Assert(buffer.Length >= limit);
>             Debug.Assert(offset <= 0 && offset <= buffer.Length);
>             new string(buffer, offset, 
> limit).ToLowerInvariant().CopyTo(0,
> buffer, offset, limit);
>         }
>
> This appears to do exactly the same thing, but much more efficiently.
> Internally, the ToLowerInvariant ultimately delegates to a native call
> (COMNlsInfo::InternalChangeCaseString) which uses Windows's 
> LCMapStringEx
> Win32 API and is orders of magnitude faster than anything we can write 
> in managed code, even taking the P/Invoke overhead and call setup 
> costs into account.
> After this change, the path through charUtils.ToLower no longer 
> dominates the execution time.
>
> Just sayin' <g>
>
>
> Vincent Van Den Berghe
>

RE: Proposal to speed up implementation of LowercaseFilter/charUtils.ToLower

Reply via email to