[ 
https://issues.apache.org/jira/browse/LUCENE-1156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated LUCENE-1156:
------------------------------------

    Attachment: LUCENE-1156.patch

This patch fixes the redirect problem and the adds an option to discard image 
only documents (default is to keep them).

It does not strip the template pages, nor does it expand them.

Patch applies from contrib/benchmark

> Wikipedia Document Generation Changes
> -------------------------------------
>
>                 Key: LUCENE-1156
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1156
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/benchmark, contrib/wikipedia
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: LUCENE-1156.patch
>
>
> The EnwikiDocMaker currently produces a fair number of documents that are in 
> the download, but are of dubious use in terms of both benchmarking and 
> indexing.  
> These issues are:
> # Redirect (it currently only handles REDIRECT and redirect, but there are 
> documents as Redirect
> # Template files appear to be useless.  These are marked by the term 
> Template: at the beginning of the body.  See for example: 
> http://en.wikipedia.org/wiki/Template:=)
> # Image only pages, as in 
> http://en.wikipedia.org/wiki/Image:Sciencefieldnewark.jpg.jpg  These are 
> about as useful as the Redirects and Templates
> # Files pending deletion:  This one is a bit trickier to handle, but they are 
> generally marked by "Wikipedia:Votes for deletion" or some variation of that 
> depending where along it is in being deleted
> I think I can implement this such that it is backward compatible, if there is 
> such a need when it comes to the contrib/benchmark suite.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to