Hi,
I'm trying to test LUCENE-843 (IndexWriter speedups) on Wikipedia
using the the benchmark contrib framework plus the patch from
LUCENE-848.
I downloaded an older wikipedia export (the "latest" doesn't seem to
exist) and got it un-tar'd. The test I'd like to run is to use 4
threads to index all (exhaust) documents. I'm using the alg below.
One problem I hit is the DirDocMaker uses a SimpleDateFormat instance
for parsing the dates at the top of each file, but, this is not
threadsafe and so I hit exceptions from there. I think we just need
to make that instance thread local I think (I will open issue).
The question I have is: is this alg going to do what I want? I'd like
each doc in Wikipedia to be indexed only once, with 4 threads running.
I *think* but I'm not sure that the alg below actually indexes the
Wikipedia content 4 times over instead?
Here's the alg:
max.field.length=2147483647
compound=false
analyzer=org.apache.lucene.analysis.SimpleAnalyzer
directory=FSDirectory
# ram.flush.mb=32
max.buffered=10000
doc.stored=true
doc.tokenized=true
doc.term.vector=true
doc.add.log.step=500
docs.dir=enwiki
doc.maker=org.apache.lucene.benchmark.byTask.feeds.DirDocMaker
# task at this depth or less would print when they start
task.max.depth.log=1
doc.maker.forever=false
#
-------------------------------------------------------------------------------------
ResetSystemErase
CreateIndex
{[AddDoc(4000)]: 4} : *
CloseIndex
RepSumByPref AddDoc
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]