Re: Wikipedia Index

Greg Bowyer Tue, 19 Jun 2012 17:43:05 -0700

It depends on what you want, but the wikipedia data dumps can be found here


http://en.wikipedia.org/wiki/Wikipedia:Database_download

On 19/06/12 17:03, Elshaimaa Ali wrote:

I only have the source text on a mysql database
Do you know where I can download it in xml and is it possible to split the 
documents into content and title
thanksshaimaa

From: luc...@mikemccandless.com
Date: Tue, 19 Jun 2012 19:48:24 -0400
Subject: Re: Wikipedia Index
To: java-user@lucene.apache.org

I have the index locally ... but it's really impractical to send it
especially if you already have the source text locally.

Maybe index directly from the source text instead of via a database?
Lucene's benchmark contrib/module has code to decode the XML into
documents...

Mike McCandless

http://blog.mikemccandless.com

On Tue, Jun 19, 2012 at 6:27 PM, Elshaimaa Ali
<elshaimaa....@hotmail.com>  wrote:

Thanks Mike for the prompt replyDo you have a fully indexed version of the 
wikipedia,  I mainly need two fields for each document the indexed content of 
the wikipedia articles  and the title.if there is any place where I can get the 
index, that will save me great time
regardsshaimaa

From: luc...@mikemccandless.com
Date: Tue, 19 Jun 2012 16:29:39 -0400
Subject: Re: Wikipedia Index
To: java-user@lucene.apache.org

Likely the bottleneck is pulling content from the database?  Maybe
test just that and see how long it takes?

24 hours is way too long to index all of Wikipedia.  For example, we
index Wikipedia every night for our trunk/4.0 performance tests, here:

     http://people.apache.org/~mikemccand/lucenebench/indexing.html

The export is a bit old now (01/15/2011) but it takes just under 6
minutes to fully index it.  This is on a fairly beefy machine (24
cores)... and trunk/4.0 has substantial concurrency improvements over
3.x.

You can also try the ideas here:

     http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

Mike McCandless

http://blog.mikemccandless.com

On Tue, Jun 19, 2012 at 12:27 PM, Elshaimaa Ali
<elshaimaa....@hotmail.com>  wrote:

Hi everybody
I'm using Lucene3.6 to index Wikipedia documents which is over 3 million 
article, the data is on a mysql database and it is taking more than 24 hours so 
far.Do you know any tips that can speed up the indexing process
here is mycode:
public static void main(String[] args) { á á á á á á String indexPath = 
INDEXPATH; á á á á á IndexWriter writer = null; á á á DatabaseConfiguration 
dbConfig = new DatabaseConfiguration(); á á á á á dbConfig.setHost(host); á á á 
á dbConfig.setDatabase(data); á á á á á á dbConfig.setUser(user); á á á á 
dbConfig.setPassword(password); á á á á dbConfig.setLanguage(Language.english);
á á á á á á á á átry { á á á á á Directory dir = FSDirectory.open(new File(indexPath)); á á á á á á á á áAnalyzer analyzer = 
new StandardAnalyzer(Version.LUCENE_31); á á á áIndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_31, analyzer); á 
á á á á á iwc.setOpenMode(OpenMode.CREATE); á á á writer = new IndexWriter(dir, iwc); á á á á á á á á á á á á } á á á á á á á 
catch (IOException e) { á á á á á á á á á á System.out.println(" caught a " + e.getClass() + á á á á á á á á 
"\n with message: " + e.getMessage()); á á á á á á á } á á á á á á á á á á á á á á try { á á á á á á á á á á á á 
Wikipedia wiki = new Wikipedia(dbConfig); á á á á á á á á á á á á á á á Iterable<Page>  wikipages = wiki.getPages(); 
//get wikipedia articles from the database á á á á á á á á á á á á áIterator iter = wikipages.iterator(); á á á á á á á á á á 
á á á while(iter.hasNext()){ á á á á á á á á á á á á áPage p = (Page)iter.next(); á á á á á á á á á á á á á á 
System.out.println(p.getTitle().getPlainTitle()); á á á á á á á á á á á á á á á á á Document doc = new Document(); á á á á á 
á á á á á á á á á á á áField contentField = new Field("contents", p.getPlainText(), Field.Store.NO, 
Field.Index.ANALYZED); á á á á á á á á á á á á á á Field titleField = new Field("title", 
p.getTitle().getPlainTitle(),Field.Store.YES, Field.Index.NOT_ANALYZED ); á á á á á á á á á á á á á á á á 
doc.add(contentField); // wiki page text á á á á á á á á á á á á á á á ádoc.add(titleField); // wiki page title á á á á á á á 
á á á á á á á á á writer.addDocument(doc); á á á á á á á á á á á á á á} á á á á á á á á á á á } catch (Exception e) { á á á á 
á á á á á á á á e.printStackTrace(); á á á á á á á á á á} á á á á á á á á á á á á á á á á }

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Wikipedia Index

Reply via email to