Grant Ingersoll <[EMAIL PROTECTED]> wrote on 18/03/2007 10:16:14: > I'm using contrib/benchmark to do some tests for my ApacheCon talk > and have some questions. > > 1. In looking at micro-standard.alg, it seems like not all braces are > closed. Is a line ending a separator too?
'>' can replace as a closing character (alternatively) either '}' or ']' with the semantics: "do not collect/report separate statistics for the contained tasks. See "Statistic recording elimination" in http://lucene.apache.org/java/docs/api/org/apache/lucene/benchmark/byTask/package-summary.html > 2. Is there anyway to dump out what params are supported by the > various tasks? I am esp. uncertain on the Search related tasks. Search related tasks do not take args. Perhaps the task should throw an exception if a params is set but not supported. I think I'll add that. Currently only AdDoc, DeleteDoc and SetProp take args. The section "Command parameter" in http://lucene.apache.org/java/docs/api/org/apache/lucene/benchmark/byTask/package-summary.html which describes this is incomplete - I will fix it to reflect that. Which query arguments do you have in mind? > 3. Is there anyway to dump out the stats as a CSV file or something? > Would I implement a Task for this? Ultimately, I want to be able to > create a graph in Excel that shows tradeoffs between speed and memory. Yes, implementing a report task would be the way. ... but when I look at how I implemented these reports, all the work is done in the class Points. Seems it should be modified a little with more thought of making it easiert to extend reports. > 4. Is there a way to set how many tabs occur between columns in the > final report? They merge and buffer factors get hard to read for > larger values. There's no general tabbing control, can be added if required, - but for the automatically added columns this is not requireed - just modify the name of the column and it would fit, e.g. use "merge:10:100" to get a 5 charactres column, or "merging:10:100" for 7, etc. (Also see "Index work parameters" under "Benchmark properties" in http://lucene.apache.org/java/docs/api/org/apache/lucene/benchmark/byTask/package-summary.html > 5. Below is my "alg" file, any tips? What I am trying to do is show > the tradeoffs of merge factor and max buffered and how it relates to > memory and indexing time. I want to process all the documents in the > Reuters benchmark collection, not the 2000 in the micro-standard. I > don't want any pauses and for now I am happy doing things in serial. > I think it is doing what I want, but am not 100% certain. > Yes, it seems correct to me. What I usually do to verify a new alg is to run it first with very small numbers - e.g. 10 instead of 22000, etc., and examine the log. Few comments: - you can specify a larger number than 22000 and the Docmaker will iterate and created new docs from same input again. - Being intetested in memory stats - the thing that all the rounds run in a single program, same JVM run, usually means what you see is very much dependent in the GC behavior of the specific VM you are using. If it does not release memory (most likely) to the OS you would not be able to notice that round i+1 used less memory than round i. It would probably better for something like this to put the "round" logic in an ant script, invoking each round in a separate new exec. But then things get more complicated for having a final stats report containing all rounds. What do you think about this? - Seems you are only inrerested in the indexing performance, so you can remove (or comment out) the search part. - If you are intrerested also in the search part, note that as written, the four last search related tasks always use a new reader (opening/closing 950 readers in this test). > ----------- alg file -------- > > #last value is more than all the docs in reuters > merge.factor=mrg:10:100:1000:5000:10:10:10:10:100:1000 > max.buffered=buf:10:10:10:10:100:1000:10000:21580:21580:21580 > compound=true > > analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer > directory=FSDirectory > #directory=RamDirectory > > doc.stored=true > doc.tokenized=true > doc.term.vector=false > doc.add.log.step=1000 > > docs.dir=reuters-out > #docs.dir=reuters-111 > > #doc.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleDocMaker > doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker > > #query.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleQueryMaker > query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker > > # task at this depth or less would print when they start > task.max.depth.log=2 > > log.queries=true > # > ------------------------------------------------------------------------ > ------------- > > { "Rounds" > > ResetSystemErase > > { "Populate" > CreateIndex > { "MAddDocs" AddDoc > : 22000 > Optimize > CloseIndex > } > > OpenReader > { "SearchSameRdr" Search > : 5000 > CloseReader > > { "WarmNewRdr" Warm > : 50 > > { "SrchNewRdr" Search > : 500 > > { "SrchTrvNewRdr" SearchTrav > : 300 > > { "SrchTrvRetNewRdr" SearchTravRet > : 100 > > NewRound > > } : 10 > > RepSumByName > RepSumByPrefRound MAddDocs > > > Thanks, > Grant > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]