Hi,

I've been doing performance measurements using the latest Lucene.net, and 
profiling with the standard English analyzer (and all analyzers with a lower 
case filter) indicates that a LOT of time is spent in 
LowerCaseFilter.IncrementToken() method, doing this:

charUtils.ToLower(termAtt.Buffer(), 0, termAtt.Length);

In my test cases, this dominates the execution time.
The performance is horrible since inside charUtils.ToLower, for every code 
point in the buffer a 1-integer array and a new string containing the string 
representation of that code point are created, which is subsequently lowercased 
and converted back:

public static int ToLowerCase(int codePoint)
    {
      return Character.CodePointAt(UnicodeUtil.NewString(new int[1]
      {
        codePoint
      }, 0, 1).ToLowerInvariant(), 0);
    }

This creates heap pressure (due to the huge amount of temporary int[1] and 
string objects that fill up Gen0) and is highly inefficient because of the 
inner loops for which the C# compiler isn't able to eliminate the bounds checks.
Yes, this is indeed what the Java code does, but in .NET the ToLowerInvariant 
method already takes care of the correct Unicode codepoints parsing, so I think 
we can replace the  charUtils.ToLower  method with the following implementation:

        public void ToLower(char[] buffer, int offset, int limit)
        {
            Debug.Assert(buffer.Length >= limit);
            Debug.Assert(offset <= 0 && offset <= buffer.Length);
            new string(buffer, offset, limit).ToLowerInvariant().CopyTo(0, 
buffer, offset, limit);
        }

This appears to do exactly the same thing, but much more efficiently. 
Internally, the ToLowerInvariant ultimately delegates to a native call 
(COMNlsInfo::InternalChangeCaseString) which uses Windows's LCMapStringEx Win32 
API and is orders of magnitude faster than anything we can write in managed 
code, even taking the P/Invoke overhead and call setup costs into account.
After this change, the path through charUtils.ToLower no longer dominates the 
execution time.

Just sayin' <g>


Vincent Van Den Berghe

Reply via email to