[jira] Created: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

Tim Armstrong (JIRA) Tue, 10 Feb 2009 15:58:32 -0800

Improvements to contrib.benchmark for TREC collections
------------------------------------------------------


                 Key: LUCENE-1540
                 URL: https://issues.apache.org/jira/browse/LUCENE-1540
             Project: Lucene - Java
          Issue Type: Improvement
          Components: contrib/benchmark
    Affects Versions: 2.4
            Reporter: Tim Armstrong
            Priority: Minor


The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
are quite limited and do not support some of the variations in format of older 
TREC collections.  

I have been doing some benchmarking work with Lucene and have had to modify the 
package to support:
* Older TREC document formats, which the current parser fails on due to missing 
document headers.
* Variations in query format - newlines after <title> tag causing the query 
parser to get confused.
* Ability to detect and read in uncompressed text collections
* Storage of document numbers by default without storing full text.

I can submit a patch if there is interest, although I will probably want to 
write unit tests for the new functionality first.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Created: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

Reply via email to