Re: [jira] Commented: (LUCENE-871) ISOLatin1AccentFilter a bit slow

eks dev Tue, 21 Aug 2007 02:49:30 -0700

just for completeness of the approaches (I think the speed-up to expect can be, 
in the best case, just about to be measurable considering big picture)


I had very nice experience with simple Bloom filter that "approximately hashes" 
characters that are repeated in switch statement.
If Bloom filter contains current char, we go and execute switch, if not we 
simply go on. Even with bigger number of false positives, in average case, it 
works faster. This depends heavily on number of chars in switch() statement, 
but in case this number is bigger we can extend filter bit length to long in 
order to reduce the number of false positives. 

I have not tried this approach on this concrete example, but very similar 
situation.


something along the lines:

static private int buildFilter( final char[] s, final int len ) {
        int i = len, bFilter = 0;
        while ( i-- != 0 ) bFilter |= 1 << ( s[ i ] & 0x1f );
        return bFilter;
    }


and than you need to check:

char c = ... to check
if ((bFilter  &  ( 1 << ( c & 0x1f ) ) ) == 0)


----- Original Message ----
From: Dawid Weiss (JIRA) <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Tuesday, 21 August, 2007 10:51:31 AM
Subject: [jira] Commented: (LUCENE-871) ISOLatin1AccentFilter a bit slow


    [ 
https://issues.apache.org/jira/browse/LUCENE-871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521361
 ] 

Dawid Weiss commented on LUCENE-871:
------------------------------------

I was a bit curious about it, so I decided to write a table-lookup version. It 
does come out faster, but only by a small margin (especially on "server", 
hotspot JVMs). 

Timings are in milliseconds, the round consisted of 100000 repetitions of 
parsing the test string "Des mot clés À LA CHAÎNE À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î 
Ï Ð Ñ Ò Ó Ô Õ Ö Ø  Þ Ù Ú Û Ü Ý  à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö 
ø  ß þ ù ú û ü ý ÿ". Note it is biased since most characters do have accents, 
which will not be the case in real life I gues... but still:

// SUN JVM build 1.6.0-b105, -server mode
Round (old): 1922
Round (old): 1688
Round (old): 1656
Round (old): 1687
Round (old): 1641
Round (old): 1703
Round (old): 1672
Round (old): 1672
Round (old): 1687
Round (old): 1719
Round (new): 1719
Round (new): 1609
Round (new): 1609
Round (new): 1594
Round (new): 1625
Round (new): 1578
Round (new): 1625
Round (new): 1594
Round (new): 1625
Round (new): 1656

// SUN JVM, 1.6.0, interpreted (-client)

Round (old): 2391
Round (old): 2453
Round (old): 2359
Round (old): 2375
Round (old): 2391
Round (old): 2359
Round (old): 2156
Round (old): 2532
Round (old): 2422
Round (old): 2359
Round (new): 1969
Round (new): 1906
Round (new): 1922
Round (new): 1937
Round (new): 1985
Round (new): 1922
Round (new): 1906
Round (new): 1937
Round (new): 1985
Round (new): 1922

// IBM JVM 1.5.0 (don't know why it's so sluggish, really).

Round (old): 7906
Round (old): 7188
Round (old): 7625
Round (old): 7312
Round (old): 7266
Round (old): 7141
Round (old): 7015
Round (old): 5641
Round (old): 5578
Round (old): 5672
Round (new): 4656
Round (new): 4406
Round (new): 4516
Round (new): 4516
Round (new): 4375
Round (new): 4375
Round (new): 4343
Round (new): 4297
Round (new): 4344
Round (new): 4266

// IBM 1.5.0, -server (note the speed improvement when the old version is 
hotspot-optimized).

Round (old): 5922
Round (old): 5078
Round (old): 5078
Round (old): 5062
Round (old): 4985
Round (old): 4875
Round (old): 4953
Round (old): 4641
Round (old): 3640
Round (old): 3735
Round (new): 3750
Round (new): 3781
Round (new): 3656
Round (new): 3516
Round (new): 3515
Round (new): 3594
Round (new): 3547
Round (new): 3562
Round (new): 3532
Round (new): 3531

So... it does come out a bit faster. Whether it makes sense to waste 130 kb of 
memory for this improvement.... don't know, really. I'll upload the 
table-lookup version for your reference.

> ISOLatin1AccentFilter a bit slow
> --------------------------------
>
>                 Key: LUCENE-871
>                 URL: https://issues.apache.org/jira/browse/LUCENE-871
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1, 2.2
>            Reporter: Ian Boston
>            Assignee: Michael McCandless
>             Fix For: 2.3
>
>         Attachments: fasterisoremove1.patch, fasterisoremove2.patch, 
> ISOLatin1AccentFilter.java.patch, LUCENE-871.take4.patch
>
>
> The ISOLatin1AccentFilter is a bit slow giving 300+ ms responses when used in 
> a highligher for output responses.
> Patch to follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






      ___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-871) ISOLatin1AccentFilter a bit slow

Reply via email to