[
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980022#action_12980022
]
Doron Cohen commented on LUCENE-1540:
-------------------------------------
Indeed TrecContentSource is inadequate for the Trec-Disks-4+5-minus-CR
collection (FBIS, FR94, FT, LATimes) so I am writing something to process this
collection, in which, interestingly, each sub-collection's format slightly
differs. (Will use this with the robust 2004 queries.) If there are ready to
use building blocks for this that would be helpful.
I think of writing separate content source implementations for each format -
current one being gov2 format, and at the method openNextFile() identify the
correct trec format according to the file path - i.e. if it is under LATimes
will use that appropriate content source. The default will remain as today, for
backcompat, and will be used if the path does not match any of the defined
patterns.Also should be possible to specify - perhaps in a property - the
default trec format.
> Improvements to contrib.benchmark for TREC collections
> ------------------------------------------------------
>
> Key: LUCENE-1540
> URL: https://issues.apache.org/jira/browse/LUCENE-1540
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/benchmark
> Affects Versions: 2.4
> Reporter: Tim Armstrong
> Assignee: Doron Cohen
> Priority: Minor
>
> The benchmarking utilities for TREC test collections (http://trec.nist.gov)
> are quite limited and do not support some of the variations in format of
> older TREC collections.
> I have been doing some benchmarking work with Lucene and have had to modify
> the package to support:
> * Older TREC document formats, which the current parser fails on due to
> missing document headers.
> * Variations in query format - newlines after <title> tag causing the query
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to
> write unit tests for the new functionality first.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]