[ 
https://issues.apache.org/jira/browse/LUCENE-937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507233
 ] 

Mark Miller commented on LUCENE-937:
------------------------------------

I am testing on Java 1.5

I tested with LinkedList using get, LinkedList using iterator, ArrayList() 
(defaults to 10), ArrayList(16), ArrayList(30), ArrayList(60)

For a handful of tokens, all methods are about the same speed. (3-10 tokens) 
LinkedList, ArrayList(16) and ArrayList are a smidgen faster than 
ArrayList(30-60) though.

At 15-40 tokens all of the methods are still about the same speed, though 
ArrayList(30) may be a hair faster (than even ArrayList(10))

At 50-300 tokens (weighted towards 50), you see a 47% increase in speed using 
ArrayList(16 or 30) and ArrayList(60), the other methods are about the same 
speed.

At 150-900 tokens (weighted towards 150), you see a 100% increase in speed 
using ArrayList(30) -- ArrayList(60) is no better,
the other methods take twice as long (even ArrayList(10)).


I used Reuters data, shortening it and stiching it together for the various 
sizes.

The first test was the average speed of about 21,000 runs on 21,000 different 
docs.

The second test was the average speed of about 21,000 runs on 21,000 different 
docs.

The third test was the average speed of about 15,000 runs on 15,000 different 
docs.

The fourth test was the average speed of about 5,000 runs on 5,000 different 
docs.


I don't completely understand why, but ArrayList(30) or ArrayList(16) blow 
everything else out of the water. ArrayList(16) is a smidgen faster at very low
token counts, while ArrayList(30) is a smidgen faster above 50 or so tokens. 
The differences are almost negligible though.

ArrayList(30) or ArrayList(16) are approx the same speed as LinkedList (get and 
iterator are always approx the same speed)
 at low numbers and get better and better VERY quickly as the number of tokens 
goes up.

I'd recommend ArrayList(16) as the best choice for this class. It is no worse 
than LinkedList on very small documents, and much better than LinkedList on 
small to large documents.

A friend was recently telling me that ArrayList defaulted to 16, but it does 
not -- it defaults to 10. He must have been confused and known that it *should* 
default to 16. 16 is a much better default number than 10 due to the 
exponential growth when doubling the array on resize. It may take a bit more 
memory, but it gets a LOT more speed.

- Mark

> Make CachingTokenFilter faster
> ------------------------------
>
>                 Key: LUCENE-937
>                 URL: https://issues.apache.org/jira/browse/LUCENE-937
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Mark Miller
>            Priority: Minor
>         Attachments: CachingTokenFilter.patch
>
>
> The wrong data structure was used for the CachingTokenFilter. It should be an 
> ArrayList rather than a LinkedList. There is a noticeable difference in speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to