TrecDocMaker skips over documents when "Date" is missing from documents
-----------------------------------------------------------------------

                 Key: LUCENE-1479
                 URL: https://issues.apache.org/jira/browse/LUCENE-1479
             Project: Lucene - Java
          Issue Type: Bug
          Components: contrib/benchmark
            Reporter: Shai Erera
             Fix For: 2.4.1


TrecDocMaker skips over Trec documents if they do not have a "Date" line. When 
such a document is encountered, the code may skip over several documents until 
the next tag that is searched for is found.
The result is, instead of reading ~25M documents from the GOV2 collection, the 
code reads only ~23M (don't remember the actual numbers).

The fix adds a terminatingTag to read() such that the code looks for prefix, 
but only until terminatingTag is found. Appropriate changes were made in 
getNextDocData().

Patch to follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to