Re: Wikipedia Index

Reyna Melara Tue, 19 Jun 2012 14:27:20 -0700

Could it be possible to index Wikipedia in a 2 core machine with 3 GB in
RAM? I have had the same problem trying to index it.


I've tried with a dump from april 2011.

Thanks
Reyna
CIC-IPN
Mexico

2012/6/19 Michael McCandless <[email protected]>

> Likely the bottleneck is pulling content from the database?  Maybe
> test just that and see how long it takes?
>
> 24 hours is way too long to index all of Wikipedia.  For example, we
> index Wikipedia every night for our trunk/4.0 performance tests, here:
>
>    http://people.apache.org/~mikemccand/lucenebench/indexing.html
>
> The export is a bit old now (01/15/2011) but it takes just under 6
> minutes to fully index it.  This is on a fairly beefy machine (24
> cores)... and trunk/4.0 has substantial concurrency improvements over
> 3.x.
>
> You can also try the ideas here:
>
>    http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Tue, Jun 19, 2012 at 12:27 PM, Elshaimaa Ali
> <[email protected]> wrote:
> >
> > Hi everybody
> > I'm using Lucene3.6 to index Wikipedia documents which is over 3 million
> article, the data is on a mysql database and it is taking more than 24
> hours so far.Do you know any tips that can speed up the indexing process
> > here is mycode:
> > public static void main(String[] args) {             String indexPath =
> INDEXPATH;           IndexWriter writer = null;       DatabaseConfiguration
> dbConfig = new DatabaseConfiguration();           dbConfig.setHost(host);
>       dbConfig.setDatabase(data);             dbConfig.setUser(user);
>   dbConfig.setPassword(password);
> dbConfig.setLanguage(Language.english);
> >                  try {           Directory dir = FSDirectory.open(new
> File(indexPath));                  Analyzer analyzer = new
> StandardAnalyzer(Version.LUCENE_31);        IndexWriterConfig iwc = new
> IndexWriterConfig(Version.LUCENE_31, analyzer);
> iwc.setOpenMode(OpenMode.CREATE);       writer = new IndexWriter(dir, iwc);
>                         }               catch (IOException e) {
>         System.out.println(" caught a " + e.getClass() +
> "\n with message: " + e.getMessage());               }
>         try {                         Wikipedia wiki = new
> Wikipedia(dbConfig);                               Iterable<Page> wikipages
> = wiki.getPages(); //get wikipedia articles from the database
>            Iterator iter = wikipages.iterator();
> while(iter.hasNext()){                          Page p = (Page)iter.next();
>
> System.out.println(p.getTitle().getPlainTitle());
>         Document doc = new Document();
>  Field contentField = new Field("contents", p.getPlainText(),
> Field.Store.NO, Field.Index.ANALYZED);                             Field
> titleField = new Field("title",
> p.getTitle().getPlainTitle(),Field.Store.YES, Field.Index.NOT_ANALYZED );
>                               doc.add(contentField); // wiki page text
>                            doc.add(titleField); // wiki page title
>                         writer.addDocument(doc);
>  }                       } catch (Exception e) {
> e.printStackTrace();                    }                                 }
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


-- 
Reyna

Re: Wikipedia Index

Reply via email to