[jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

Steven Parkes (JIRA) Mon, 09 Apr 2007 11:29:55 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steven Parkes updated LUCENE-848:
---------------------------------

    Attachment: LUCENE-848.txt

This patch is a first cut a wikipedia benchmark support. It downloads the 
current english pages from the Wikipedia download site ... which, of course, is 
actually not there right now. I'm not quite sure what's up, but you can find 
the files at http://download.wikimedia.org/enwiki/20070402/ right now if you 
want to play.

It adds ExtractWikipedia.java, which uses Xerces-J to grab the individual 
articles. It writes the articles in the same format as the Reuters stuff, so a 
generecised ReutersDocMaker, DirDocMaker, works.

The current size of the download file is 2.1G bzip2'd. It's supposed to contain 
about 1.2M documents but I came out with 2 or 3, I think, so there maybe 
"extra" files in there. (Some entries are links and I tried to get rid of 
those, but I may have missed a particular coding or case).

For the first pass, I copied the Reuters steps of decompressing and parsing. 
This creates big temporary files. Moreover, it creates a big directory tree in 
the end. (The extractor uses a fixed number of documents per directory and 
grows the depth of the tree logarithmically, a lot like Lucene segments).

It's not clear how this preprocessing-to-a-directory-tree compares to on the 
fly decompression, which would require less disk seeks on the input during 
indexing. May try that at some point ...

> Add supported for Wikipedia English as a corpus in the benchmarker stuff
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/benchmark
>            Reporter: Steven Parkes
>         Assigned To: Steven Parkes
>            Priority: Minor
>             Fix For: 2.2
>
>         Attachments: LUCENE-848.txt, WikipediaHarvester.java
>
>
> Add support for using Wikipedia for benchmarking.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

Reply via email to