Database? I imagine I can avoid that.... Wiki dump.gz -> gunzip -> parse -> index no?
Otis ----- Original Message ---- From: Chris Lu <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, December 12, 2007 1:55:02 AM Subject: Re: Indexing Wikipedia dumps For a quick java approach, give yourself 3 minutes and try to use DBSight to access the database. You can simply use "select * from mw_searchindex" as a starting point. It'll build the index for you. However, you may need to pluggin your custom analyzer for media wiki's format(Or maybe not). -- Chris Lu ------------------------- Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes DBSight customer (remain anonymous per request) got 2.6 Million Euro funding! On Dec 11, 2007 9:35 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Hi, > > I need to index a Wikipedia dump. I know there is code in contrib/benchmark for indexing *English* Wikipedia for benchmarking purposes. However, I'd like to index a non-English dump, and I actually don't need it for benchmarking, I just want to end up with a Lucene index. > > Any suggestions where I should start? That is, can anything in contrib/benchmark already do this, or is there anything there that I should use as a starting point? As opposed to writing my own Wikipedia XML dump parser+indexer. > > Thanks, > Otis > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]