Why not?
public void ToLower(char[] buffer, int offset, int limit)
{
Debug.Assert(buffer.Length >= limit);
Debug.Assert(offset <= 0 && offset <= buffer.Length);
for (int i = offset; i < limit; i++)
{
buffer[i] = char.ToLowerInvariant(buffer[i]);
}
}
This does zero allocations all together.
Better yet, see:
https://github.com/ravendb/ravendb/blob/v4.0/src/Raven.Server/Documents/DocumentKeyWorker.cs#L35
for (var i = 0; i < key.Length; i++)
{
var ch = key[i];
if (ch > 127) // not ASCII, use slower mode
goto UnlikelyUnicode;
if ((ch >= 65) && (ch <= 90))
buffer[i] = (byte) (ch | 0x20);
else
buffer[i] = (byte) ch;
}
For the common case of ASCII chars, this is much more efficient.
*Hibernating Rhinos Ltd *
Oren Eini* l CEO l *Mobile: + 972-52-548-6969 <052-548-6969>
Office: +972-4-622-7811 <04-622-7811> *l *Fax: +972-153-4-622-7811
On Wed, Dec 28, 2016 at 10:46 AM, Van Den Berghe, Vincent <
[email protected]> wrote:
> Hi,
>
> I've been doing performance measurements using the latest Lucene.net, and
> profiling with the standard English analyzer (and all analyzers with a
> lower case filter) indicates that a LOT of time is spent in
> LowerCaseFilter.IncrementToken() method, doing this:
>
> charUtils.ToLower(termAtt.Buffer(), 0, termAtt.Length);
>
> In my test cases, this dominates the execution time.
> The performance is horrible since inside charUtils.ToLower, for every code
> point in the buffer a 1-integer array and a new string containing the
> string representation of that code point are created, which is subsequently
> lowercased and converted back:
>
> public static int ToLowerCase(int codePoint)
> {
> return Character.CodePointAt(UnicodeUtil.NewString(new int[1]
> {
> codePoint
> }, 0, 1).ToLowerInvariant(), 0);
> }
>
> This creates heap pressure (due to the huge amount of temporary int[1] and
> string objects that fill up Gen0) and is highly inefficient because of the
> inner loops for which the C# compiler isn't able to eliminate the bounds
> checks.
> Yes, this is indeed what the Java code does, but in .NET the
> ToLowerInvariant method already takes care of the correct Unicode
> codepoints parsing, so I think we can replace the charUtils.ToLower
> method with the following implementation:
>
> public void ToLower(char[] buffer, int offset, int limit)
> {
> Debug.Assert(buffer.Length >= limit);
> Debug.Assert(offset <= 0 && offset <= buffer.Length);
> new string(buffer, offset, limit).ToLowerInvariant().CopyTo(0,
> buffer, offset, limit);
> }
>
> This appears to do exactly the same thing, but much more efficiently.
> Internally, the ToLowerInvariant ultimately delegates to a native call
> (COMNlsInfo::InternalChangeCaseString) which uses Windows's LCMapStringEx
> Win32 API and is orders of magnitude faster than anything we can write in
> managed code, even taking the P/Invoke overhead and call setup costs into
> account.
> After this change, the path through charUtils.ToLower no longer dominates
> the execution time.
>
> Just sayin' <g>
>
>
> Vincent Van Den Berghe
>