Great catch! This is similar to the char sequence which can sometimes replaced by string builder .net platform has many advantages doing many things better natively, so whenever it's possible to take this advantage, it would be great
Sent from my iPhone > On 28 Dec 2016, at 10:46, Van Den Berghe, Vincent > <[email protected]> wrote: > > Hi, > > I've been doing performance measurements using the latest Lucene.net, and > profiling with the standard English analyzer (and all analyzers with a lower > case filter) indicates that a LOT of time is spent in > LowerCaseFilter.IncrementToken() method, doing this: > > charUtils.ToLower(termAtt.Buffer(), 0, termAtt.Length); > > In my test cases, this dominates the execution time. > The performance is horrible since inside charUtils.ToLower, for every code > point in the buffer a 1-integer array and a new string containing the string > representation of that code point are created, which is subsequently > lowercased and converted back: > > public static int ToLowerCase(int codePoint) > { > return Character.CodePointAt(UnicodeUtil.NewString(new int[1] > { > codePoint > }, 0, 1).ToLowerInvariant(), 0); > } > > This creates heap pressure (due to the huge amount of temporary int[1] and > string objects that fill up Gen0) and is highly inefficient because of the > inner loops for which the C# compiler isn't able to eliminate the bounds > checks. > Yes, this is indeed what the Java code does, but in .NET the ToLowerInvariant > method already takes care of the correct Unicode codepoints parsing, so I > think we can replace the charUtils.ToLower method with the following > implementation: > > public void ToLower(char[] buffer, int offset, int limit) > { > Debug.Assert(buffer.Length >= limit); > Debug.Assert(offset <= 0 && offset <= buffer.Length); > new string(buffer, offset, limit).ToLowerInvariant().CopyTo(0, > buffer, offset, limit); > } > > This appears to do exactly the same thing, but much more efficiently. > Internally, the ToLowerInvariant ultimately delegates to a native call > (COMNlsInfo::InternalChangeCaseString) which uses Windows's LCMapStringEx > Win32 API and is orders of magnitude faster than anything we can write in > managed code, even taking the P/Invoke overhead and call setup costs into > account. > After this change, the path through charUtils.ToLower no longer dominates the > execution time. > > Just sayin' <g> > > > Vincent Van Den Berghe
