just for completeness of the approaches (I think the speed-up to expect can be, in the best case, just about to be measurable considering big picture)
I had very nice experience with simple Bloom filter that "approximately hashes" characters that are repeated in switch statement. If Bloom filter contains current char, we go and execute switch, if not we simply go on. Even with bigger number of false positives, in average case, it works faster. This depends heavily on number of chars in switch() statement, but in case this number is bigger we can extend filter bit length to long in order to reduce the number of false positives. I have not tried this approach on this concrete example, but very similar situation. something along the lines: static private int buildFilter( final char[] s, final int len ) { int i = len, bFilter = 0; while ( i-- != 0 ) bFilter |= 1 << ( s[ i ] & 0x1f ); return bFilter; } and than you need to check: char c = ... to check if ((bFilter & ( 1 << ( c & 0x1f ) ) ) == 0) ----- Original Message ---- From: Dawid Weiss (JIRA) <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Tuesday, 21 August, 2007 10:51:31 AM Subject: [jira] Commented: (LUCENE-871) ISOLatin1AccentFilter a bit slow [ https://issues.apache.org/jira/browse/LUCENE-871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521361 ] Dawid Weiss commented on LUCENE-871: ------------------------------------ I was a bit curious about it, so I decided to write a table-lookup version. It does come out faster, but only by a small margin (especially on "server", hotspot JVMs). Timings are in milliseconds, the round consisted of 100000 repetitions of parsing the test string "Des mot clés À LA CHAÎNE À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Þ Ù Ú Û Ü Ý à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ß þ ù ú û ü ý ÿ". Note it is biased since most characters do have accents, which will not be the case in real life I gues... but still: // SUN JVM build 1.6.0-b105, -server mode Round (old): 1922 Round (old): 1688 Round (old): 1656 Round (old): 1687 Round (old): 1641 Round (old): 1703 Round (old): 1672 Round (old): 1672 Round (old): 1687 Round (old): 1719 Round (new): 1719 Round (new): 1609 Round (new): 1609 Round (new): 1594 Round (new): 1625 Round (new): 1578 Round (new): 1625 Round (new): 1594 Round (new): 1625 Round (new): 1656 // SUN JVM, 1.6.0, interpreted (-client) Round (old): 2391 Round (old): 2453 Round (old): 2359 Round (old): 2375 Round (old): 2391 Round (old): 2359 Round (old): 2156 Round (old): 2532 Round (old): 2422 Round (old): 2359 Round (new): 1969 Round (new): 1906 Round (new): 1922 Round (new): 1937 Round (new): 1985 Round (new): 1922 Round (new): 1906 Round (new): 1937 Round (new): 1985 Round (new): 1922 // IBM JVM 1.5.0 (don't know why it's so sluggish, really). Round (old): 7906 Round (old): 7188 Round (old): 7625 Round (old): 7312 Round (old): 7266 Round (old): 7141 Round (old): 7015 Round (old): 5641 Round (old): 5578 Round (old): 5672 Round (new): 4656 Round (new): 4406 Round (new): 4516 Round (new): 4516 Round (new): 4375 Round (new): 4375 Round (new): 4343 Round (new): 4297 Round (new): 4344 Round (new): 4266 // IBM 1.5.0, -server (note the speed improvement when the old version is hotspot-optimized). Round (old): 5922 Round (old): 5078 Round (old): 5078 Round (old): 5062 Round (old): 4985 Round (old): 4875 Round (old): 4953 Round (old): 4641 Round (old): 3640 Round (old): 3735 Round (new): 3750 Round (new): 3781 Round (new): 3656 Round (new): 3516 Round (new): 3515 Round (new): 3594 Round (new): 3547 Round (new): 3562 Round (new): 3532 Round (new): 3531 So... it does come out a bit faster. Whether it makes sense to waste 130 kb of memory for this improvement.... don't know, really. I'll upload the table-lookup version for your reference. > ISOLatin1AccentFilter a bit slow > -------------------------------- > > Key: LUCENE-871 > URL: https://issues.apache.org/jira/browse/LUCENE-871 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis > Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1, 2.2 > Reporter: Ian Boston > Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: fasterisoremove1.patch, fasterisoremove2.patch, > ISOLatin1AccentFilter.java.patch, LUCENE-871.take4.patch > > > The ISOLatin1AccentFilter is a bit slow giving 300+ ms responses when used in > a highligher for output responses. > Patch to follow -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___________________________________________________________ Yahoo! Answers - Got a question? Someone out there knows the answer. Try it now. http://uk.answers.yahoo.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]