Note that the current code doesn't actually do anything with the wiki syntax, but I would think as long as the other language is in the same format you should be fine.

-Grant

On Dec 12, 2007, at 5:28 AM, Michael McCandless wrote:


I haven't actually tried it, but I think very likely the current code in contrib/benchmark might be able to extract non-English Wikipedia dump as well?

Have a look at contrib/benchmark/conf/extractWikipedia.alg: I think if you just change the docs.file to reference your downloaded XML file it could just work?

Mike

Otis Gospodnetic wrote:

Hi,

I need to index a Wikipedia dump. I know there is code in contrib/ benchmark for indexing *English* Wikipedia for benchmarking purposes. However, I'd like to index a non-English dump, and I actually don't need it for benchmarking, I just want to end up with a Lucene index.

Any suggestions where I should start? That is, can anything in contrib/benchmark already do this, or is there anything there that I should use as a starting point? As opposed to writing my own Wikipedia XML dump parser+indexer.

Thanks,
Otis



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to