Wikipedia Document Generation Changes
-------------------------------------

                 Key: LUCENE-1156
                 URL: https://issues.apache.org/jira/browse/LUCENE-1156
             Project: Lucene - Java
          Issue Type: Bug
          Components: contrib/benchmark, contrib/wikipedia
            Reporter: Grant Ingersoll
            Assignee: Grant Ingersoll
            Priority: Minor


The EnwikiDocMaker currently produces a fair number of documents that are in 
the download, but are of dubious use in terms of both benchmarking and 
indexing.  

These issues are:

# Redirect (it currently only handles REDIRECT and redirect, but there are 
documents as Redirect
# Template files appear to be useless.  These are marked by the term Template: 
at the beginning of the body.  See for example: 
http://en.wikipedia.org/wiki/Template:=)
# Image only pages, as in 
http://en.wikipedia.org/wiki/Image:Sciencefieldnewark.jpg.jpg  These are about 
as useful as the Redirects and Templates
# Files pending deletion:  This one is a bit trickier to handle, but they are 
generally marked by "Wikipedia:Votes for deletion" or some variation of that 
depending where along it is in being deleted

I think I can implement this such that it is backward compatible, if there is 
such a need when it comes to the contrib/benchmark suite.





-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to