[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

Shai Erera (JIRA) Tue, 11 Jan 2011 04:18:15 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980080#action_12980080
 ]


Shai Erera commented on LUCENE-1540:
------------------------------------

Perhaps instead of separate ContentSource implementations, we can have 
TrecContentSource use a TrecDocParser (new class) or something, for parsing 
different formats. We can then have Gov2Parser, LATimesParser etc. for parsing 
the different formats, and TrecContentSource would use the appropriate parser 
per the path detected, as you suggest.

In addition, we can have it use a specific format through a configuration 
parameter, in which case it will not attempt to auto-detect the right format, 
but always use the specified parser. Through Benchmark (as well as all other 
contrib / modules) does not need to maintain back-compat, I think that if we go 
with this approach, it can default to using the Gov2Parser, and thus you 
achieve backwards support.

> Improvements to contrib.benchmark for TREC collections
> ------------------------------------------------------
>
>                 Key: LUCENE-1540
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1540
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Tim Armstrong
>            Assignee: Doron Cohen
>            Priority: Minor
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
> are quite limited and do not support some of the variations in format of 
> older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify 
> the package to support:
> * Older TREC document formats, which the current parser fails on due to 
> missing document headers.
> * Variations in query format - newlines after <title> tag causing the query 
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to 
> write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

Reply via email to