Re: Indexing Wikipedia dumps

Karl Wettin Wed, 12 Dec 2007 06:49:49 -0800


12 dec 2007 kl. 06.35 skrev Otis Gospodnetic:

I need to index a Wikipedia dump. I know there is code in contrib/benchmark for indexing *English* Wikipedia for benchmarkingpurposes. However, I'd like to index a non-English dump, and Iactually don't need it for benchmarking, I just want to end up witha Lucene index.
Any suggestions where I should start? That is, can anything incontrib/benchmark already do this, or is there anything there that Ishould use as a starting point? As opposed to writing my ownWikipedia XML dump parser+indexer.



Here is one more alternative, the way I did it way back.

Get the tarballs containing rendered HTML. Using NekoHTML (or so) findthe DOM-node that contains the text content. And there you go, plaintext.



--
karl



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing Wikipedia dumps

Reply via email to