OK, I ran some benchmarks here. The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and 17.2% speedup using Sun's JDK 6, on Linux. This is indexing all Wikipedia content using LowerCaseTokenizer + StopFilter + PorterStemFilter. I think it's worth pursuing!
Here are the optimizations I tested: * Change core analyzers to re-use a single Token instance and reuse the char[] termBuffer (using a new method "boolean next(Token t)" so it's backwards compatible). * For the StopFilter I created a new helper class (CharArraySet) to create a hash set that can key off of char[]'s without having to new a String. * Fix the analyzer to re-use the same tokenizer across documents & fields (rather than new'ing one every time) I ran tests with "java -server -Xmx1024M", running on an Intel Core II Duo box with Debian Linux 2.6.18 kernel and a RAID 5 IO system. I index all text (every single term) in Wikipedia, pulling from a single line file (I'm using the patch from LUCENE-947 that adds line-file creation & indexing to contrib/benchmark). First I create a single large file that has one doc per line from Wikipedia content, using this alg: docs.dir=enwiki doc.maker=org.apache.lucene.benchmark.byTask.feeds.DirDocMaker line.file.out=/lucene/wikifull.txt doc.maker.forever=false {WriteLineDoc()}: * Resulting file is 8.4 GB and 3.2 million docs. Then I indexed it using this alg: analyzer=org.apache.lucene.analysis.LowercaseStopPorterAnalyzer directory=FSDirectory ram.flush.mb=64 max.field.length=2147483647 compound=false max.buffered=70000 doc.add.log.step=5000 docs.file=/lucene/wikifull.txt doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker doc.tokenized=true doc.maker.forever=false ResetSystemErase CreateIndex { "All" {AddDoc}: * } CloseIndex RepSumByPref All RepSumByPref AddDoc Resulting index is 2.2 GB. The LowercaseStopPorterAnalyzer just looks like this: public class LowercaseStopPorterAnalyzer extends Analyzer { Tokenizer tokenizer; TokenStream stream; public final TokenStream tokenStream(String fieldName, Reader reader) { if (tokenizer == null) { tokenizer = new LowerCaseTokenizer(reader); stream = new PorterStemFilter(new StopFilter(tokenizer, StopAnalyzer.ENGLISH_STOP_WORDS)); } else tokenizer.reset(reader); return stream; } }; I then record the elapsed time reported by the "All" task. I ran each test twice and took the faster time: JDK 5 Trunk 21 min 41 sec JDK 5 New 18 min 54 sec -> 12.8% faster JDK 6 Trunk 21 min 43 sec JDK 6 New 17 min 59 sec -> 17.2% faster It's rather odd that we see better gains in JDK 6 ... I had expected the reverse (assuming GC performance is better in JDK 6 than JDK 5). I also think it's quite cool that we can index all of Wikipedia in 18 minutes :) That works out to ~8 MB/sec. I will open an issue... Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]