Re: contrib/benchmark questions

Doron Cohen Mon, 19 Mar 2007 10:35:51 -0800

Grant Ingersoll <[EMAIL PROTECTED]> wrote on 18/03/2007 10:16:14:

> I'm using contrib/benchmark to do some tests for my ApacheCon talk
> and have some questions.
>
> 1. In looking at micro-standard.alg, it seems like not all braces are
> closed.  Is a line ending a separator too?


'>' can replace as a closing character (alternatively) either '}' or ']'
with the semantics: "do not collect/report separate statistics for the
contained tasks. See "Statistic recording elimination" in
http://lucene.apache.org/java/docs/api/org/apache/lucene/benchmark/byTask/package-summary.html

> 2. Is there anyway to dump out what params are supported by the
> various tasks?  I am esp. uncertain on the Search related tasks.

Search related tasks do not take args. Perhaps the task should throw an
exception if a params is set but not supported. I think I'll add that.
Currently only AdDoc, DeleteDoc and SetProp take args. The section "Command
parameter" in
http://lucene.apache.org/java/docs/api/org/apache/lucene/benchmark/byTask/package-summary.html
 which describes this is incomplete - I will fix it to reflect that.

Which query arguments do you have in mind?

> 3. Is there anyway to dump out the stats as a CSV file or something?
> Would I implement a Task for this?  Ultimately, I want to be able to
> create a graph in Excel that shows tradeoffs between speed and memory.

Yes, implementing a report task would be the way.
... but when I look at how I implemented these reports, all the work is
done in the class Points. Seems it should be modified a little with more
thought of making it easiert to extend reports.

> 4. Is there a way to set how many tabs occur between columns in the
> final report?  They merge and buffer factors get hard to read for
> larger values.

There's no general tabbing control, can be added if required, - but for the
automatically added columns this is not requireed - just modify the name of
the column and it would fit, e.g. use "merge:10:100" to get a 5 charactres
column, or "merging:10:100" for 7, etc. (Also see "Index work parameters"
under "Benchmark properties" in
http://lucene.apache.org/java/docs/api/org/apache/lucene/benchmark/byTask/package-summary.html

> 5. Below is my "alg" file, any tips?  What I am trying to do is show
> the tradeoffs of merge factor and max buffered and how it relates to
> memory and indexing time.  I want to process all the documents in the
> Reuters benchmark collection, not the 2000 in the micro-standard.  I
> don't want any pauses and for now I am happy doing things in serial.
> I think it is doing what I want, but am not 100% certain.
>

Yes, it seems correct to me. What I usually do to verify a new alg is to
run it first with very small numbers - e.g. 10 instead of 22000, etc., and
examine the log. Few comments:
- you can specify a larger number than 22000 and the Docmaker will iterate
and created new docs from same input again.
- Being intetested in memory stats - the thing that all the rounds run in a
single program, same JVM run, usually means what you see is very much
dependent in the GC behavior of the specific VM you are using. If it does
not release memory (most likely) to the OS you would not be able to notice
that round i+1 used less memory than round i. It would probably better for
something like this to put the "round" logic in an ant script, invoking
each round in a separate new exec. But then things get more complicated for
having a final stats report containing all rounds. What do you think about
this?
- Seems you are only inrerested in the indexing performance, so you can
remove (or comment out) the search part.
- If you are intrerested also in the search part, note that as written, the
four last search related tasks always use a new reader (opening/closing 950
readers in this test).


> -----------  alg file --------
>
> #last value is more than all the docs in reuters
> merge.factor=mrg:10:100:1000:5000:10:10:10:10:100:1000
> max.buffered=buf:10:10:10:10:100:1000:10000:21580:21580:21580
> compound=true
>
> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
> directory=FSDirectory
> #directory=RamDirectory
>
> doc.stored=true
> doc.tokenized=true
> doc.term.vector=false
> doc.add.log.step=1000
>
> docs.dir=reuters-out
> #docs.dir=reuters-111
>
> #doc.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleDocMaker
> doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
>
> #query.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleQueryMaker
> query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker
>
> # task at this depth or less would print when they start
> task.max.depth.log=2
>
> log.queries=true
> #
> ------------------------------------------------------------------------
> -------------
>
> { "Rounds"
>
>      ResetSystemErase
>
>      { "Populate"
>          CreateIndex
>          { "MAddDocs" AddDoc > : 22000
>          Optimize
>          CloseIndex
>      }
>
>      OpenReader
>      { "SearchSameRdr" Search > : 5000
>      CloseReader
>
>      { "WarmNewRdr" Warm > : 50
>
>      { "SrchNewRdr" Search > : 500
>
>      { "SrchTrvNewRdr" SearchTrav > : 300
>
>      { "SrchTrvRetNewRdr" SearchTravRet > : 100
>
>      NewRound
>
> } : 10
>
> RepSumByName
> RepSumByPrefRound MAddDocs
>
>
> Thanks,
> Grant
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: contrib/benchmark questions

Reply via email to