Karl von Randow created LUCENE-7525:
---------------------------------------
Summary: ASCIIFoldingFilter.foldToASCII performance issue due to
large compiled method size
Key: LUCENE-7525
URL: https://issues.apache.org/jira/browse/LUCENE-7525
Project: Lucene - Core
Issue Type: Improvement
Components: modules/analysis
Affects Versions: 6.2.1
Reporter: Karl von Randow
The {{ASCIIFoldingFilter.foldToASCII}} method has an enormous switch statement
and is too large for the HotSpot compiler to compile; causing a performance
problem.
The method is about 13K compiled, versus the 8KB HotSpot limit. So splitting
the method in half works around the problem.
In my tests splitting the method in half resulted in a 5X performance increase.
In the test code below you can see how slow the fold method is, even when it is
using the shortcut when the character is less than 0x80, compared to an inline
implementation of the same shortcut.
So a workaround is to split the method. I'm happy to provide a patch. It's a
hack, of course. Perhaps using the {{MappingCharFilterFactory}} with an input
file as per SOLR-2013 would be a better replacement for this method in this
class?
{code:java}
public class ASCIIFoldingFilterPerformanceTest {
private static final int ITERATIONS = 1_000_000;
@Test
public void testFoldShortString() {
char[] input = "testing".toCharArray();
char[] output = new char[input.length * 4];
for (int i = 0; i < ITERATIONS; i++) {
ASCIIFoldingFilter.foldToASCII(input, 0, output, 0,
input.length);
}
}
@Test
public void testFoldShortAccentedString() {
char[] input = "éúéúøßüäéúéúøßüä".toCharArray();
char[] output = new char[input.length * 4];
for (int i = 0; i < ITERATIONS; i++) {
ASCIIFoldingFilter.foldToASCII(input, 0, output, 0,
input.length);
}
}
@Test
public void testManualFoldTinyString() {
char[] input = "t".toCharArray();
char[] output = new char[input.length * 4];
for (int i = 0; i < ITERATIONS; i++) {
int k = 0;
for (int j = 0; j < 1; ++j) {
final char c = input[j];
if (c < '\u0080') {
output[k++] = c;
} else {
Assert.assertTrue(false);
}
}
}
}
}
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]