Re: Indexing Wikipedia dumps

Grant Ingersoll Wed, 12 Dec 2007 05:12:07 -0800

Note that the current code doesn't actually do anything with the wikisyntax, but I would think as long as the other language is in the sameformat you should be fine.


-Grant


On Dec 12, 2007, at 5:28 AM, Michael McCandless wrote:

I haven't actually tried it, but I think very likely the currentcode in contrib/benchmark might be able to extract non-EnglishWikipedia dump as well?
Have a look at contrib/benchmark/conf/extractWikipedia.alg: I thinkif you just change the docs.file to reference your downloaded XMLfile it could just work?
Mike

Otis Gospodnetic wrote:
Hi,
I need to index a Wikipedia dump. I know there is code in contrib/benchmark for indexing *English* Wikipedia for benchmarkingpurposes. However, I'd like to index a non-English dump, and Iactually don't need it for benchmarking, I just want to end up witha Lucene index.
Any suggestions where I should start? That is, can anything incontrib/benchmark already do this, or is there anything there thatI should use as a starting point? As opposed to writing my ownWikipedia XML dump parser+indexer.
Thanks,
Otis



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing Wikipedia dumps

Reply via email to