Re: threads & benchmark contrib

Doron Cohen Fri, 22 Jun 2007 23:27:12 -0700

Mike, I didn't anticipate this use case and I think it
would not work correctly. I'll look into this.


Anyhow, I think it would not work as you expect.

It seems what you want is to have 4 threads, adding docs in
parallel, until the doc maker is exhausted.

But this line:
  {[AddDoc(4000)]: 4} : *

Reads as -
  Repeatedly until exhausted:
     Create & Start 4 threads (in parallel),
       each adding 1 doc of size 4000;
     Wait for them 4 threads to complete.

Now, this is not what you are after, is it? I think
you would like just 4 threads to do all the work.

It seems what you are really after is this:
   [ { AddDoc } : * ] : 4

This reads as:
  Create 4 threads, each adding docs until exhaustion.

Since there is a single system-benchmark-wide doc-maker, all 4
threads use it, and when it is exhausted, all 4 will be done.

I tried this way and it works as I expected it to (except
for that DateFormat bug, see below). Can you try like this
and let me know if it works for you.

I think your variation of this exposes a bug in the
benchmark - it will just loop forever because the parallel
sequence would mask the exhaustion from the outer sequential
sequence. I opened LUCENE-941 for this, and looking into it

Doron

"Michael McCandless" <[EMAIL PROTECTED]> wrote on 22/06/2007
13:18:10:
>
> Hi,
>
> I'm trying to test LUCENE-843 (IndexWriter speedups) on Wikipedia
> using the the benchmark contrib framework plus the patch from
> LUCENE-848.
>
> I downloaded an older wikipedia export (the "latest" doesn't seem to
> exist) and got it un-tar'd.  The test I'd like to run is to use 4
> threads to index all (exhaust) documents.  I'm using the alg below.
>
> One problem I hit is the DirDocMaker uses a SimpleDateFormat instance
> for parsing the dates at the top of each file, but, this is not
> threadsafe and so I hit exceptions from there.  I think we just need
> to make that instance thread local I think (I will open issue).

Yes, tha's a bug...  It is also in some already committed parts
of the benchmark. I opened LUCENE-940 for this.

>
> The question I have is: is this alg going to do what I want?  I'd like
> each doc in Wikipedia to be indexed only once, with 4 threads running.
> I *think* but I'm not sure that the alg below actually indexes the
> Wikipedia content 4 times over instead?
>
> Here's the alg:
>
> max.field.length=2147483647
> compound=false
>
> analyzer=org.apache.lucene.analysis.SimpleAnalyzer
> directory=FSDirectory
> # ram.flush.mb=32
> max.buffered=10000
> doc.stored=true
> doc.tokenized=true
> doc.term.vector=true
> doc.add.log.step=500
>
> docs.dir=enwiki
>
> doc.maker=org.apache.lucene.benchmark.byTask.feeds.DirDocMaker
>
> # task at this depth or less would print when they start
> task.max.depth.log=1
> doc.maker.forever=false
>
> #
>
-------------------------------------------------------------------------------------

>
> ResetSystemErase
> CreateIndex
> {[AddDoc(4000)]: 4} : *
> CloseIndex
>
> RepSumByPref AddDoc
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: threads & benchmark contrib

Reply via email to