[ https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steven Parkes updated LUCENE-848: --------------------------------- Attachment: LUCENE-848.txt This patch is a first cut a wikipedia benchmark support. It downloads the current english pages from the Wikipedia download site ... which, of course, is actually not there right now. I'm not quite sure what's up, but you can find the files at http://download.wikimedia.org/enwiki/20070402/ right now if you want to play. It adds ExtractWikipedia.java, which uses Xerces-J to grab the individual articles. It writes the articles in the same format as the Reuters stuff, so a generecised ReutersDocMaker, DirDocMaker, works. The current size of the download file is 2.1G bzip2'd. It's supposed to contain about 1.2M documents but I came out with 2 or 3, I think, so there maybe "extra" files in there. (Some entries are links and I tried to get rid of those, but I may have missed a particular coding or case). For the first pass, I copied the Reuters steps of decompressing and parsing. This creates big temporary files. Moreover, it creates a big directory tree in the end. (The extractor uses a fixed number of documents per directory and grows the depth of the tree logarithmically, a lot like Lucene segments). It's not clear how this preprocessing-to-a-directory-tree compares to on the fly decompression, which would require less disk seeks on the input during indexing. May try that at some point ... > Add supported for Wikipedia English as a corpus in the benchmarker stuff > ------------------------------------------------------------------------ > > Key: LUCENE-848 > URL: https://issues.apache.org/jira/browse/LUCENE-848 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark > Reporter: Steven Parkes > Assigned To: Steven Parkes > Priority: Minor > Fix For: 2.2 > > Attachments: LUCENE-848.txt, WikipediaHarvester.java > > > Add support for using Wikipedia for benchmarking. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]