Could it be possible to index Wikipedia in a 2 core machine with 3 GB in RAM? I have had the same problem trying to index it.
I've tried with a dump from april 2011. Thanks Reyna CIC-IPN Mexico 2012/6/19 Michael McCandless <luc...@mikemccandless.com> > Likely the bottleneck is pulling content from the database? Maybe > test just that and see how long it takes? > > 24 hours is way too long to index all of Wikipedia. For example, we > index Wikipedia every night for our trunk/4.0 performance tests, here: > > http://people.apache.org/~mikemccand/lucenebench/indexing.html > > The export is a bit old now (01/15/2011) but it takes just under 6 > minutes to fully index it. This is on a fairly beefy machine (24 > cores)... and trunk/4.0 has substantial concurrency improvements over > 3.x. > > You can also try the ideas here: > > http://wiki.apache.org/lucene-java/ImproveIndexingSpeed > > Mike McCandless > > http://blog.mikemccandless.com > > On Tue, Jun 19, 2012 at 12:27 PM, Elshaimaa Ali > <elshaimaa....@hotmail.com> wrote: > > > > Hi everybody > > I'm using Lucene3.6 to index Wikipedia documents which is over 3 million > article, the data is on a mysql database and it is taking more than 24 > hours so far.Do you know any tips that can speed up the indexing process > > here is mycode: > > public static void main(String[] args) { String indexPath = > INDEXPATH; IndexWriter writer = null; DatabaseConfiguration > dbConfig = new DatabaseConfiguration(); dbConfig.setHost(host); > dbConfig.setDatabase(data); dbConfig.setUser(user); > dbConfig.setPassword(password); > dbConfig.setLanguage(Language.english); > > try { Directory dir = FSDirectory.open(new > File(indexPath)); Analyzer analyzer = new > StandardAnalyzer(Version.LUCENE_31); IndexWriterConfig iwc = new > IndexWriterConfig(Version.LUCENE_31, analyzer); > iwc.setOpenMode(OpenMode.CREATE); writer = new IndexWriter(dir, iwc); > } catch (IOException e) { > System.out.println(" caught a " + e.getClass() + > "\n with message: " + e.getMessage()); } > try { Wikipedia wiki = new > Wikipedia(dbConfig); Iterable<Page> wikipages > = wiki.getPages(); //get wikipedia articles from the database > Iterator iter = wikipages.iterator(); > while(iter.hasNext()){ Page p = (Page)iter.next(); > > System.out.println(p.getTitle().getPlainTitle()); > Document doc = new Document(); > Field contentField = new Field("contents", p.getPlainText(), > Field.Store.NO, Field.Index.ANALYZED); Field > titleField = new Field("title", > p.getTitle().getPlainTitle(),Field.Store.YES, Field.Index.NOT_ANALYZED ); > doc.add(contentField); // wiki page text > doc.add(titleField); // wiki page title > writer.addDocument(doc); > } } catch (Exception e) { > e.printStackTrace(); } } > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Reyna