[ 
https://issues.apache.org/jira/browse/LUCENE-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-790:
-------------------------------

    Attachment: TrecDocMaker.patch

Attached TrecDocMaker.patch also contains the changes in current patch in 788 - 
because both patches modify ReutersDocMaker - so it is sufficient to apply this 
patch only. I will add a comment on that in 788. Once this is committed, will 
mark 788 as duplicate of this. 

Some TODO items are in byTask/Benchmark.java's javadocs - comments are welcome. 

> contrib/benchmark - few improvements and a bug fix
> --------------------------------------------------
>
>                 Key: LUCENE-790
>                 URL: https://issues.apache.org/jira/browse/LUCENE-790
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Other
>    Affects Versions: 2.1
>            Reporter: Doron Cohen
>         Assigned To: Doron Cohen
>            Priority: Minor
>             Fix For: 2.1
>
>         Attachments: TrecDocMaker.patch
>
>
> Benchmark byTask was slightly improved:
> 1. fixed a bug in the "child-should-not-report" mechanism. If a task sequence 
> contained only simple tasks it worked as expected (i.e. child tasks did not 
> report times/memory) but if a child was a task sequence, then its children 
> would report - they should not - this was fixed, so this property is now 
> "penetrating/inherited" all the way down.
> 2. doc size control now possible also for the Reuters doc maker. (allowing to 
> index N docs of size C characters each.)
> 3. TrecDocMaker was added - it reads as input the .gz files used in Trec - 
> e.g. .gov data - this can be handy to benchmark Lucene on these large 
> collections.  Similar to the Reuters collection, the doc-maker scans the 
> input directory for all the files and extracts documents from the files.  
> Here there are multiple documents in each input file. Unlike the Reuters 
> collection, we cannot provide a 'loader' for these collections - they are 
> available from http://trec.nist.gov - for research purposes.
> 4. a new BasicDocMaker abstract class handles most of doc-maker tasks, 
> including creating docs with specific size, so adding new doc-makers for 
> other data is now much simpler.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to