Mike, I didn't anticipate this use case and I think it
would not work correctly. I'll look into this.
Anyhow, I think it would not work as you expect.
It seems what you want is to have 4 threads, adding docs in
parallel, until the doc maker is exhausted.
But this line:
{[AddDoc(4000)]: 4} : *
Reads as -
Repeatedly until exhausted:
Create & Start 4 threads (in parallel),
each adding 1 doc of size 4000;
Wait for them 4 threads to complete.
Now, this is not what you are after, is it? I think
you would like just 4 threads to do all the work.
It seems what you are really after is this:
[ { AddDoc } : * ] : 4
This reads as:
Create 4 threads, each adding docs until exhaustion.
Since there is a single system-benchmark-wide doc-maker, all 4
threads use it, and when it is exhausted, all 4 will be done.
I tried this way and it works as I expected it to (except
for that DateFormat bug, see below). Can you try like this
and let me know if it works for you.
I think your variation of this exposes a bug in the
benchmark - it will just loop forever because the parallel
sequence would mask the exhaustion from the outer sequential
sequence. I opened LUCENE-941 for this, and looking into it
Doron
"Michael McCandless" <[EMAIL PROTECTED]> wrote on 22/06/2007
13:18:10:
>
> Hi,
>
> I'm trying to test LUCENE-843 (IndexWriter speedups) on Wikipedia
> using the the benchmark contrib framework plus the patch from
> LUCENE-848.
>
> I downloaded an older wikipedia export (the "latest" doesn't seem to
> exist) and got it un-tar'd. The test I'd like to run is to use 4
> threads to index all (exhaust) documents. I'm using the alg below.
>
> One problem I hit is the DirDocMaker uses a SimpleDateFormat instance
> for parsing the dates at the top of each file, but, this is not
> threadsafe and so I hit exceptions from there. I think we just need
> to make that instance thread local I think (I will open issue).
Yes, tha's a bug... It is also in some already committed parts
of the benchmark. I opened LUCENE-940 for this.
>
> The question I have is: is this alg going to do what I want? I'd like
> each doc in Wikipedia to be indexed only once, with 4 threads running.
> I *think* but I'm not sure that the alg below actually indexes the
> Wikipedia content 4 times over instead?
>
> Here's the alg:
>
> max.field.length=2147483647
> compound=false
>
> analyzer=org.apache.lucene.analysis.SimpleAnalyzer
> directory=FSDirectory
> # ram.flush.mb=32
> max.buffered=10000
> doc.stored=true
> doc.tokenized=true
> doc.term.vector=true
> doc.add.log.step=500
>
> docs.dir=enwiki
>
> doc.maker=org.apache.lucene.benchmark.byTask.feeds.DirDocMaker
>
> # task at this depth or less would print when they start
> task.max.depth.log=1
> doc.maker.forever=false
>
> #
>
-------------------------------------------------------------------------------------
>
> ResetSystemErase
> CreateIndex
> {[AddDoc(4000)]: 4} : *
> CloseIndex
>
> RepSumByPref AddDoc
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]