Indeed, from the log fragment I can see the merges are just really slow. You had 6 merges run:
IW 0 [Wed Aug 03 22:43:24 CEST 2011; Lucene Merge Thread #0]: merged segment size=1234.550 MB vs estimate=1300.063 MB IW 0 [Thu Aug 04 00:15:54 CEST 2011; Lucene Merge Thread #4]: merged segment size=740.168 MB vs estimate=780.602 MB IW 0 [Thu Aug 04 00:29:49 CEST 2011; Lucene Merge Thread #1]: merged segment size=1165.862 MB vs estimate=1224.516 MB IW 0 [Thu Aug 04 00:39:36 CEST 2011; Lucene Merge Thread #5]: merged segment size=899.690 MB vs estimate=943.422 MB IW 0 [Thu Aug 04 00:39:52 CEST 2011; Lucene Merge Thread #3]: merged segment size=1046.637 MB vs estimate=1097.111 MB IW 0 [Thu Aug 04 01:07:04 CEST 2011; Lucene Merge Thread #2]: merged segment size=1281.083 MB vs estimate=1340.087 MB And the times are long: IW 0 [Wed Aug 03 22:43:25 CEST 2011; Lucene Merge Thread #0]: merge time 4194615 msec for 744793 docs IW 0 [Thu Aug 04 00:15:55 CEST 2011; Lucene Merge Thread #4]: merge time 6461433 msec for 1205717 docs IW 0 [Thu Aug 04 00:29:50 CEST 2011; Lucene Merge Thread #1]: merge time 9783566 msec for 1472419 docs IW 0 [Thu Aug 04 00:39:38 CEST 2011; Lucene Merge Thread #5]: merge time 7209832 msec for 1468231 docs IW 0 [Thu Aug 04 00:39:53 CEST 2011; Lucene Merge Thread #3]: merge time 8662995 msec for 1699997 docs IW 0 [Thu Aug 04 01:07:04 CEST 2011; Lucene Merge Thread #2]: merge time 11197195 msec for 1944231 docs Though, for all but the first merge, the times include the "paused" time, so it's not a real measure of how long the merge took. Still, 4195 seconds to merge to a ~1300 MB merged segment is really quite long, but I think one big reason here is you are allowing too many merge threads at once. I would set CMS.setMaxThreadCount(1) and CMS.setMaxMergeCount(2), and I would lower the number of indexing threads to 2. I think you IO system is a big bottleneck here, not only because of merging and flushing but also because presumably the source of the docs is on this same single laptop spinning drive right? Mike McCandless http://blog.mikemccandless.com On Wed, Aug 3, 2011 at 7:31 PM, Devon H. O'Dell <devon.od...@gmail.com> wrote: > For what it's worth, I've seen this happen too (using the stock Lucene > 3.3 Java APIs), but it requires me to index many millions of > documents, and doesn't start being a really big problem until the > indexes get to be closer to 250GB in size. When they reach around 1TB, > it will take around an hour for the merge to complete (which is > frustrating). Similar to Pierre-Henri, I see virtually no disk I/O > when it happens and the system in question is one of the Amazon EC2 > "Huge" instances (so, something like 8 cores and 32GB RAM) and disk > I/O during indexing pushes around 100MB/s. > > If it would be useful to see additional reports / information from > this scenario, I'm sure I can get something put together. > > --dho > > 2011/8/3 Pierre-Henri Toussaint <pierrehenri.toussa...@gmail.com>: >> OK so the problem definitely comes from the slow merging. >> I slightly increased the number merge count and thread to avoid the problem >> described previously. But as expected, it just delayed it ! >> >> results : 75 minutes to index the 33GB xml file, and 150 minutes to finish >> the merge after indexer.close. >> See uploaded http://lucene.472066.n3.nabble.com/file/n3223874/slowmerge log >> file containing: logs (timems:numberofdocsindexed/current_title) + >> infoStream + random threaddump. >> You can spot "indexer.close (no optimize)" (line 5721) for indexing >> completion and the beginning of merging nightmare. >> >> *conf : >> */conf.setRAMBufferSizeMB(512); >> ConcurrentMergeScheduler mergeScheduler = new ConcurrentMergeScheduler(); >> mergeScheduler.setMaxMergeCount(6); >> mergeScheduler.setMaxThreadCount(4); >> conf.setMergeScheduler(mergeScheduler); >> writer = new ThreadedIndexWriter(directory, analyzer, true, 2, 5, conf);/ >>>>everything else default. no optimize called >> *documents : >> */pageDocument.add(new Field("title", page.getTitle(), Field.Store.YES, >> Field.Index.NO)); >> pageDocument.add(new Field("text", page.getText(), Field.Store.NO, >> Field.Index.ANALYZED)); >> if (page.getContributorUserName() != null) >> pageDocument.add(new Field("contributorUserName", >> page.getContributorUserName(), Field.Store.NO, Field.Index.ANALYZED));/ >> *infoStream info :* >> setInfoStream >> deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@2dafae45 >> dir=org.apache.lucene.store.NIOFSDirectory@/Users/ptoussaint/Documents/workspace/wikisearch/index2 >> lockFactory=org.apache.lucene.store.NativeFSLockFactory@39dd3812 >> index= >> version=4.0-SNAPSHOT >> matchVersion=LUCENE_40 >> analyzer=org.pache.soundcloud.wikisearch.Indexer$WikiAnalyzer >> delPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy >> commit=null >> openMode=CREATE_OR_APPEND >> similarityProvider=org.apache.lucene.search.DefaultSimilarityProvider >> termIndexInterval=32 >> mergeScheduler=org.apache.lucene.index.ConcurrentMergeScheduler >> default WRITE_LOCK_TIMEOUT=1000 >> writeLockTimeout=1000 >> maxBufferedDeleteTerms=-1 >> ramBufferSizeMB=512.0 >> maxBufferedDocs=-1 >> mergedSegmentWarmer=null >> codecProvider=org.apache.lucene.index.codecs.CoreCodecProvider@6a8c436b >> mergePolicy=[TieredMergePolicy: maxMergeAtOnce=10, >> maxMergeAtOnceExplicit=30, maxMergedSegmentMB=5120.0, floorSegmentMB=2.0, >> expungeDeletesPctAllowed=10.0, segmentsPerTier=10.0, useCompoundFile=true, >> noCFSRatio=0.1 >> indexerThreadPool=org.apache.lucene.index.ThreadAffinityDocumentsWriterThreadPool@1e9e5c73 >> readerPooling=false >> readerTermsIndexDivisor=1 >> flushPolicy=org.apache.lucene.index.FlushByRamOrCountsPolicy@2ec791b9 >> perThreadHardLimitMB=1945 >> >> >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Thread-locking-while-merging-ConcurrentMergeScheduler-issue-tp3222427p3223874.html >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org