Hi,
I've been doing performance measurements using the latest Lucene.net, and
profiling with the standard English analyzer (and all analyzers with a lower
case filter) indicates that a LOT of time is spent in
LowerCaseFilter.IncrementToken() method, doing this:
charUtils.ToLower(termAtt.Buffer(), 0, termAtt.Length);
In my test cases, this dominates the execution time.
The performance is horrible since inside charUtils.ToLower, for every code
point in the buffer a 1-integer array and a new string containing the string
representation of that code point are created, which is subsequently lowercased
and converted back:
public static int ToLowerCase(int codePoint)
{
return Character.CodePointAt(UnicodeUtil.NewString(new int[1]
{
codePoint
}, 0, 1).ToLowerInvariant(), 0);
}
This creates heap pressure (due to the huge amount of temporary int[1] and
string objects that fill up Gen0) and is highly inefficient because of the
inner loops for which the C# compiler isn't able to eliminate the bounds checks.
Yes, this is indeed what the Java code does, but in .NET the ToLowerInvariant
method already takes care of the correct Unicode codepoints parsing, so I think
we can replace the charUtils.ToLower method with the following implementation:
public void ToLower(char[] buffer, int offset, int limit)
{
Debug.Assert(buffer.Length >= limit);
Debug.Assert(offset <= 0 && offset <= buffer.Length);
new string(buffer, offset, limit).ToLowerInvariant().CopyTo(0,
buffer, offset, limit);
}
This appears to do exactly the same thing, but much more efficiently.
Internally, the ToLowerInvariant ultimately delegates to a native call
(COMNlsInfo::InternalChangeCaseString) which uses Windows's LCMapStringEx Win32
API and is orders of magnitude faster than anything we can write in managed
code, even taking the P/Invoke overhead and call setup costs into account.
After this change, the path through charUtils.ToLower no longer dominates the
execution time.
Just sayin' <g>
Vincent Van Den Berghe