Probably want a combination of extractWikipedia.alg and wikipedia.alg? You want the EnwikiDocMaker from extractWikipedia.alg which reads the uncompressed xml file but rather than using WriteLineDoc, you want to go ahead and index as wikipedia.alg does. (Ditch the query part.)
You'll need an acceptable analyzer, which StandardAnalyzer might not be. -----Original Message----- From: Michael McCandless [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 12, 2007 2:29 AM To: java-user@lucene.apache.org Subject: Re: Indexing Wikipedia dumps I haven't actually tried it, but I think very likely the current code in contrib/benchmark might be able to extract non-English Wikipedia dump as well? Have a look at contrib/benchmark/conf/extractWikipedia.alg: I think if you just change the docs.file to reference your downloaded XML file it could just work? Mike Otis Gospodnetic wrote: > Hi, > > I need to index a Wikipedia dump. I know there is code in contrib/ > benchmark for indexing *English* Wikipedia for benchmarking > purposes. However, I'd like to index a non-English dump, and I > actually don't need it for benchmarking, I just want to end up with > a Lucene index. > > Any suggestions where I should start? That is, can anything in > contrib/benchmark already do this, or is there anything there that > I should use as a starting point? As opposed to writing my own > Wikipedia XML dump parser+indexer. > > Thanks, > Otis > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]