Probably want a combination of extractWikipedia.alg and wikipedia.alg?

You want the EnwikiDocMaker from extractWikipedia.alg which reads the
uncompressed xml file but rather than using WriteLineDoc, you want to go
ahead and index as wikipedia.alg does. (Ditch the query part.)

You'll need an acceptable analyzer, which StandardAnalyzer might not be.

-----Original Message-----
From: Michael McCandless [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, December 12, 2007 2:29 AM
To: java-user@lucene.apache.org
Subject: Re: Indexing Wikipedia dumps


I haven't actually tried it, but I think very likely the current code  
in contrib/benchmark might be able to extract non-English Wikipedia  
dump as well?

Have a look at contrib/benchmark/conf/extractWikipedia.alg: I think  
if you just change the docs.file to reference your downloaded XML  
file it could just work?

Mike

Otis Gospodnetic wrote:

> Hi,
>
> I need to index a Wikipedia dump.  I know there is code in contrib/ 
> benchmark for indexing *English* Wikipedia for benchmarking  
> purposes.  However, I'd like to index a non-English dump, and I  
> actually don't need it for benchmarking, I just want to end up with  
> a Lucene index.
>
> Any suggestions where I should start?  That is, can anything in  
> contrib/benchmark already do this, or is there anything there that  
> I should use as a starting point?  As opposed to writing my own  
> Wikipedia XML dump parser+indexer.
>
> Thanks,
> Otis
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to