Hi,

I'm trying to test LUCENE-843 (IndexWriter speedups) on Wikipedia
using the the benchmark contrib framework plus the patch from
LUCENE-848.

I downloaded an older wikipedia export (the "latest" doesn't seem to
exist) and got it un-tar'd.  The test I'd like to run is to use 4
threads to index all (exhaust) documents.  I'm using the alg below.

One problem I hit is the DirDocMaker uses a SimpleDateFormat instance
for parsing the dates at the top of each file, but, this is not
threadsafe and so I hit exceptions from there.  I think we just need
to make that instance thread local I think (I will open issue).

The question I have is: is this alg going to do what I want?  I'd like
each doc in Wikipedia to be indexed only once, with 4 threads running.
I *think* but I'm not sure that the alg below actually indexes the
Wikipedia content 4 times over instead?

Here's the alg:

max.field.length=2147483647
compound=false

analyzer=org.apache.lucene.analysis.SimpleAnalyzer
directory=FSDirectory
# ram.flush.mb=32
max.buffered=10000
doc.stored=true
doc.tokenized=true
doc.term.vector=true
doc.add.log.step=500

docs.dir=enwiki

doc.maker=org.apache.lucene.benchmark.byTask.feeds.DirDocMaker

# task at this depth or less would print when they start
task.max.depth.log=1
doc.maker.forever=false

# 
-------------------------------------------------------------------------------------

ResetSystemErase
CreateIndex
{[AddDoc(4000)]: 4} : *
CloseIndex

RepSumByPref AddDoc

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to