I don't think that a per-character call to char.ToLowerInvariant is identical to a call to string.ToLowerInvariant() where the string is the character array to be converted. This only works if a character represents one code point.
A single char corresponds to exactly one code point only if you have characters from the BMP (Basic Multilingual Pane). Otherwise, this code is invalid: a character may be part of a surrogate pair, for which the character(s) following it are part of the same code point. If you lowercase character per character, such cases may be incorrectly converted. See https://msdn.microsoft.com/en-us/library/windows/desktop/dd374069(v=vs.85).aspx and https://msdn.microsoft.com/en-us/library/system.char.convertfromutf32(v=vs.110).aspx for some background on the matter. .NET calls two distinct API's for converting strings and characters: one is not implemented with the other. You indirectly bring up an excellent point: my code assumes that the string constructed from the char[] is in canonical composition. I think this is the case, but I'm not that familiar with Lucene's handling of characters to be sure. Otherwise, one needs to insert Normalize(NormalizationForm.FormC) before the call to ToLowerInvariant(). But it may be overkill. Vincent -----Original Message----- From: Oren Eini (Ayende Rahien) [mailto:[email protected]] Sent: Wednesday, December 28, 2016 10:27 AM To: [email protected] Subject: Re: Proposal to speed up implementation of LowercaseFilter/charUtils.ToLower Why not? public void ToLower(char[] buffer, int offset, int limit) { Debug.Assert(buffer.Length >= limit); Debug.Assert(offset <= 0 && offset <= buffer.Length); for (int i = offset; i < limit; i++) { buffer[i] = char.ToLowerInvariant(buffer[i]); } } This does zero allocations all together. Better yet, see: https://github.com/ravendb/ravendb/blob/v4.0/src/Raven.Server/Documents/DocumentKeyWorker.cs#L35 for (var i = 0; i < key.Length; i++) { var ch = key[i]; if (ch > 127) // not ASCII, use slower mode goto UnlikelyUnicode; if ((ch >= 65) && (ch <= 90)) buffer[i] = (byte) (ch | 0x20); else buffer[i] = (byte) ch; } For the common case of ASCII chars, this is much more efficient. *Hibernating Rhinos Ltd * Oren Eini* l CEO l *Mobile: + 972-52-548-6969 <052-548-6969> Office: +972-4-622-7811 <04-622-7811> *l *Fax: +972-153-4-622-7811 On Wed, Dec 28, 2016 at 10:46 AM, Van Den Berghe, Vincent < [email protected]> wrote: > Hi, > > I've been doing performance measurements using the latest Lucene.net, > and profiling with the standard English analyzer (and all analyzers > with a lower case filter) indicates that a LOT of time is spent in > LowerCaseFilter.IncrementToken() method, doing this: > > charUtils.ToLower(termAtt.Buffer(), 0, termAtt.Length); > > In my test cases, this dominates the execution time. > The performance is horrible since inside charUtils.ToLower, for every > code point in the buffer a 1-integer array and a new string containing > the string representation of that code point are created, which is > subsequently lowercased and converted back: > > public static int ToLowerCase(int codePoint) > { > return Character.CodePointAt(UnicodeUtil.NewString(new int[1] > { > codePoint > }, 0, 1).ToLowerInvariant(), 0); > } > > This creates heap pressure (due to the huge amount of temporary int[1] > and string objects that fill up Gen0) and is highly inefficient > because of the inner loops for which the C# compiler isn't able to > eliminate the bounds checks. > Yes, this is indeed what the Java code does, but in .NET the > ToLowerInvariant method already takes care of the correct Unicode > codepoints parsing, so I think we can replace the charUtils.ToLower > method with the following implementation: > > public void ToLower(char[] buffer, int offset, int limit) > { > Debug.Assert(buffer.Length >= limit); > Debug.Assert(offset <= 0 && offset <= buffer.Length); > new string(buffer, offset, > limit).ToLowerInvariant().CopyTo(0, > buffer, offset, limit); > } > > This appears to do exactly the same thing, but much more efficiently. > Internally, the ToLowerInvariant ultimately delegates to a native call > (COMNlsInfo::InternalChangeCaseString) which uses Windows's > LCMapStringEx > Win32 API and is orders of magnitude faster than anything we can write > in managed code, even taking the P/Invoke overhead and call setup > costs into account. > After this change, the path through charUtils.ToLower no longer > dominates the execution time. > > Just sayin' <g> > > > Vincent Van Den Berghe >
