Hi, I'm trying to test LUCENE-843 (IndexWriter speedups) on Wikipedia using the the benchmark contrib framework plus the patch from LUCENE-848.
I downloaded an older wikipedia export (the "latest" doesn't seem to exist) and got it un-tar'd. The test I'd like to run is to use 4 threads to index all (exhaust) documents. I'm using the alg below. One problem I hit is the DirDocMaker uses a SimpleDateFormat instance for parsing the dates at the top of each file, but, this is not threadsafe and so I hit exceptions from there. I think we just need to make that instance thread local I think (I will open issue). The question I have is: is this alg going to do what I want? I'd like each doc in Wikipedia to be indexed only once, with 4 threads running. I *think* but I'm not sure that the alg below actually indexes the Wikipedia content 4 times over instead? Here's the alg: max.field.length=2147483647 compound=false analyzer=org.apache.lucene.analysis.SimpleAnalyzer directory=FSDirectory # ram.flush.mb=32 max.buffered=10000 doc.stored=true doc.tokenized=true doc.term.vector=true doc.add.log.step=500 docs.dir=enwiki doc.maker=org.apache.lucene.benchmark.byTask.feeds.DirDocMaker # task at this depth or less would print when they start task.max.depth.log=1 doc.maker.forever=false # ------------------------------------------------------------------------------------- ResetSystemErase CreateIndex {[AddDoc(4000)]: 4} : * CloseIndex RepSumByPref AddDoc Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]