TrecDocMaker skips over documents when "Date" is missing from documents
-----------------------------------------------------------------------
Key: LUCENE-1479
URL: https://issues.apache.org/jira/browse/LUCENE-1479
Project: Lucene - Java
Issue Type: Bug
Components: contrib/benchmark
Reporter: Shai Erera
Fix For: 2.4.1
TrecDocMaker skips over Trec documents if they do not have a "Date" line. When
such a document is encountered, the code may skip over several documents until
the next tag that is searched for is found.
The result is, instead of reading ~25M documents from the GOV2 collection, the
code reads only ~23M (don't remember the actual numbers).
The fix adds a terminatingTag to read() such that the code looks for prefix,
but only until terminatingTag is found. Appropriate changes were made in
getNextDocData().
Patch to follow
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]