[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12985506#action_12985506
 ] 

Shai Erera commented on LUCENE-1540:
------------------------------------

Ok though I really think the 3 vs 2 times is negligible. The extra time we add 
is very simple - it's the only one that does IO, and even then, it reads lines 
and compares them to <DOC> or </DOC> (which is a very simple comparison). From 
then on, it parses the actual TREC document in-memory.

This is something I think could have even improved the current multi-threading 
support in TrecContentSource - today the threads sync on each one reading the 
TREC document, which means parsing its structure, and the only thing that's 
done in parallel is parsing the Html content. It'd be interesting to benchmark 
the 3-passes method, where each thread would sync on reading the section from 
<DOC> to </DOC> and then proceed to actually parse the structure.

It sounds like TrecContentSource could have acted like a SAX parser, reading 
TrecDoc objects and emitting them to a BlockingQueue, while threads would read 
from it and proceed on their own.

What I do agree on is that 3-passes is unnecessarily more expensive for 
single-threaded benchmarks.

> Improvements to contrib.benchmark for TREC collections
> ------------------------------------------------------
>
>                 Key: LUCENE-1540
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1540
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Tim Armstrong
>            Assignee: Doron Cohen
>            Priority: Minor
>         Attachments: LUCENE-1540.patch
>
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
> are quite limited and do not support some of the variations in format of 
> older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify 
> the package to support:
> * Older TREC document formats, which the current parser fails on due to 
> missing document headers.
> * Variations in query format - newlines after <title> tag causing the query 
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to 
> write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to