I haven't actually tried it, but I think very likely the current code
in contrib/benchmark might be able to extract non-English Wikipedia
dump as well?
Have a look at contrib/benchmark/conf/extractWikipedia.alg: I think
if you just change the docs.file to reference your downloaded XML
file it could just work?
Mike
Otis Gospodnetic wrote:
Hi,
I need to index a Wikipedia dump. I know there is code in contrib/
benchmark for indexing *English* Wikipedia for benchmarking
purposes. However, I'd like to index a non-English dump, and I
actually don't need it for benchmarking, I just want to end up with
a Lucene index.
Any suggestions where I should start? That is, can anything in
contrib/benchmark already do this, or is there anything there that
I should use as a starting point? As opposed to writing my own
Wikipedia XML dump parser+indexer.
Thanks,
Otis
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]