OK, I ran some benchmarks here.

The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and
17.2% speedup using Sun's JDK 6, on Linux.  This is indexing all
Wikipedia content using LowerCaseTokenizer + StopFilter +
PorterStemFilter.  I think it's worth pursuing!

Here are the optimizations I tested:

  * Change core analyzers to re-use a single Token instance and reuse
    the char[] termBuffer (using a new method "boolean next(Token t)" so it's
    backwards compatible).

  * For the StopFilter I created a new helper class (CharArraySet) to
    create a hash set that can key off of char[]'s without having to
    new a String.

  * Fix the analyzer to re-use the same tokenizer across documents &
    fields (rather than new'ing one every time)

I ran tests with "java -server -Xmx1024M", running on an Intel Core II
Duo box with Debian Linux 2.6.18 kernel and a RAID 5 IO system.

I index all text (every single term) in Wikipedia, pulling from a
single line file (I'm using the patch from LUCENE-947 that adds
line-file creation & indexing to contrib/benchmark).

First I create a single large file that has one doc per line from
Wikipedia content, using this alg:

  docs.dir=enwiki
  doc.maker=org.apache.lucene.benchmark.byTask.feeds.DirDocMaker
  line.file.out=/lucene/wikifull.txt
  doc.maker.forever=false
  {WriteLineDoc()}: * 

Resulting file is 8.4 GB and 3.2 million docs.  Then I indexed it
using this alg:

  analyzer=org.apache.lucene.analysis.LowercaseStopPorterAnalyzer
  directory=FSDirectory
  ram.flush.mb=64
  max.field.length=2147483647
  compound=false
  max.buffered=70000
  doc.add.log.step=5000
  docs.file=/lucene/wikifull.txt
  doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
  doc.tokenized=true
  doc.maker.forever=false

  ResetSystemErase
  CreateIndex
  { "All"
    {AddDoc}: *
  }

  CloseIndex

  RepSumByPref All
  RepSumByPref AddDoc

Resulting index is 2.2 GB.

The LowercaseStopPorterAnalyzer just looks like this:

  public class LowercaseStopPorterAnalyzer extends Analyzer {
  
    Tokenizer tokenizer;
    TokenStream stream;
    public final TokenStream tokenStream(String fieldName, Reader reader) {
      if (tokenizer == null) {
        tokenizer = new LowerCaseTokenizer(reader);
        stream = new PorterStemFilter(new StopFilter(tokenizer, 
StopAnalyzer.ENGLISH_STOP_WORDS));
      } else
        tokenizer.reset(reader);
      return stream;
    }
  };

I then record the elapsed time reported by the "All" task.  I ran each
test twice and took the faster time:

  JDK 5 Trunk  21 min 41 sec
  JDK 5   New  18 min 54 sec
    -> 12.8% faster

  JDK 6 Trunk  21 min 43 sec
  JDK 6   New  17 min 59 sec
    -> 17.2% faster

It's rather odd that we see better gains in JDK 6 ... I had expected
the reverse (assuming GC performance is better in JDK 6 than JDK 5).

I also think it's quite cool that we can index all of Wikipedia in 18
minutes :)  That works out to ~8 MB/sec.

I will open an issue...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to