Re: Indexing Wikipedia dumps

Andy Goodell Wed, 12 Dec 2007 09:28:03 -0800

My firm uses a parser based on javax.xml.stream.XMLStreamReader to
break (english and nonenglish) wikipedia xml dumps into lucene-style
"documents and fields."  We use wikipedia to test our
language-specific code, so we've probably indexed 20 wikipedia dumps.


- andy g

On Dec 11, 2007 9:35 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I need to index a Wikipedia dump.  I know there is code in contrib/benchmark 
> for indexing *English* Wikipedia for benchmarking purposes.  However, I'd 
> like to index a non-English dump, and I actually don't need it for 
> benchmarking, I just want to end up with a Lucene index.
>
> Any suggestions where I should start?  That is, can anything in 
> contrib/benchmark already do this, or is there anything there that I should 
> use as a starting point?  As opposed to writing my own Wikipedia XML dump 
> parser+indexer.
>
> Thanks,
> Otis
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing Wikipedia dumps

Reply via email to