My firm uses a parser based on javax.xml.stream.XMLStreamReader to break (english and nonenglish) wikipedia xml dumps into lucene-style "documents and fields." We use wikipedia to test our language-specific code, so we've probably indexed 20 wikipedia dumps.
- andy g On Dec 11, 2007 9:35 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Hi, > > I need to index a Wikipedia dump. I know there is code in contrib/benchmark > for indexing *English* Wikipedia for benchmarking purposes. However, I'd > like to index a non-English dump, and I actually don't need it for > benchmarking, I just want to end up with a Lucene index. > > Any suggestions where I should start? That is, can anything in > contrib/benchmark already do this, or is there anything there that I should > use as a starting point? As opposed to writing my own Wikipedia XML dump > parser+indexer. > > Thanks, > Otis > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]