[ 
https://issues.apache.org/jira/browse/LUCENE-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662452#action_12662452
 ] 

Shai Erera commented on LUCENE-1479:
------------------------------------

The reason why this patch does not include a test case is because it requires 
the TREC data set. Is it valid to add a test case which will fail if the TREC 
data is missing? If not, can you suggest how can I simulate it?
I can create several documents in the TREC format and feed the TrecDocMaker 
with those files.
Or ... I'll look into extending TrecDocMaker and instead of feeding it with 
File(s), I'll feed it with some mock documents (String), which simulate the 
bug. Not sure if that's doable right-away - might need to change a method to 
protected.

Also, I'm not near the code now, so I can't tell if DocData allows for a null 
Date. But I guess it's just easier to assign the current date, for simplicity 
(you never know if at some point date becomes a *must* ...).
I kept that logic from TrecDocMaker w/o the patch ...

Shai

> TrecDocMaker skips over documents when "Date" is missing from documents
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-1479
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1479
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/benchmark
>            Reporter: Shai Erera
>            Assignee: Michael McCandless
>             Fix For: 2.4.1, 2.9
>
>         Attachments: LUCENE-1479.patch
>
>
> TrecDocMaker skips over Trec documents if they do not have a "Date" line. 
> When such a document is encountered, the code may skip over several documents 
> until the next tag that is searched for is found.
> The result is, instead of reading ~25M documents from the GOV2 collection, 
> the code reads only ~23M (don't remember the actual numbers).
> The fix adds a terminatingTag to read() such that the code looks for prefix, 
> but only until terminatingTag is found. Appropriate changes were made in 
> getNextDocData().
> Patch to follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to