Re: Indexing Wikipedia dumps

Matt Kangas Tue, 11 Dec 2007 22:20:21 -0800

Otis, if you're willing to use some non-Java code for your task...

1) Wikipedia uses Lucene for their full-text searches, and the moduleis part of Mediawiki. You could use this as follows:

- Install Mediawiki
- Load your Wikipedia dump into MW (and MySQL)
- Build a search index for the Lucene Search extension:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/lucene-search/README.txt?revision=8535&view=markup

2) Alternately, use Mediawiki's native import parser (in PHP) and usethat to feed Solr, etc. The code is a bit hairy, 'tho.

http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/SpecialImport.php?revision=27686&view=markup

--Matt

On Dec 12, 2007, at 12:35 AM, Otis Gospodnetic wrote:

Hi,
I need to index a Wikipedia dump. I know there is code in contrib/benchmark for indexing *English* Wikipedia for benchmarkingpurposes. However, I'd like to index a non-English dump, and Iactually don't need it for benchmarking, I just want to end up witha Lucene index.
Any suggestions where I should start? That is, can anything incontrib/benchmark already do this, or is there anything there that Ishould use as a starting point? As opposed to writing my ownWikipedia XML dump parser+indexer.
Thanks,
Otis


--
Matt Kangas / [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing Wikipedia dumps

Reply via email to