Re: Wikipedia Index

Michael McCandless Tue, 19 Jun 2012 13:30:28 -0700

Likely the bottleneck is pulling content from the database?  Maybe
test just that and see how long it takes?


24 hours is way too long to index all of Wikipedia.  For example, we
index Wikipedia every night for our trunk/4.0 performance tests, here:

    http://people.apache.org/~mikemccand/lucenebench/indexing.html

The export is a bit old now (01/15/2011) but it takes just under 6
minutes to fully index it.  This is on a fairly beefy machine (24
cores)... and trunk/4.0 has substantial concurrency improvements over
3.x.

You can also try the ideas here:

    http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

Mike McCandless

http://blog.mikemccandless.com

On Tue, Jun 19, 2012 at 12:27 PM, Elshaimaa Ali
<[email protected]> wrote:
>
> Hi everybody
> I'm using Lucene3.6 to index Wikipedia documents which is over 3 million 
> article, the data is on a mysql database and it is taking more than 24 hours 
> so far.Do you know any tips that can speed up the indexing process
> here is mycode:
> public static void main(String[] args) {             String indexPath = 
> INDEXPATH;           IndexWriter writer = null;       DatabaseConfiguration 
> dbConfig = new DatabaseConfiguration();           dbConfig.setHost(host);     
>     dbConfig.setDatabase(data);             dbConfig.setUser(user);         
> dbConfig.setPassword(password);         
> dbConfig.setLanguage(Language.english);
>                  try {           Directory dir = FSDirectory.open(new 
> File(indexPath));                  Analyzer analyzer = new 
> StandardAnalyzer(Version.LUCENE_31);        IndexWriterConfig iwc = new 
> IndexWriterConfig(Version.LUCENE_31, analyzer);             
> iwc.setOpenMode(OpenMode.CREATE);       writer = new IndexWriter(dir, iwc);   
>                       }               catch (IOException e) {                 
>     System.out.println(" caught a " + e.getClass() +                 "\n with 
> message: " + e.getMessage());               }                             try 
> {                         Wikipedia wiki = new Wikipedia(dbConfig);           
>                     Iterable<Page> wikipages = wiki.getPages(); //get 
> wikipedia articles from the database                          Iterator iter = 
> wikipages.iterator();                           while(iter.hasNext()){        
>                   Page p = (Page)iter.next();                             
> System.out.println(p.getTitle().getPlainTitle());                             
>       Document doc = new Document();                                  Field 
> contentField = new Field("contents", p.getPlainText(), Field.Store.NO, 
> Field.Index.ANALYZED);                             Field titleField = new 
> Field("title", p.getTitle().getPlainTitle(),Field.Store.YES, 
> Field.Index.NOT_ANALYZED );                                 
> doc.add(contentField); // wiki page text                                
> doc.add(titleField); // wiki page title                                 
> writer.addDocument(doc);                            }                       } 
> catch (Exception e) {                         e.printStackTrace();            
>         }                                 }
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Wikipedia Index

Reply via email to