[jira] Commented: (LUCENE-994) Change defaults in IndexWriter to maximize "out of the box" performance

Michael McCandless (JIRA) Wed, 26 Sep 2007 07:51:11 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530458
 ]


Michael McCandless commented on LUCENE-994:
-------------------------------------------

Hmmm ... it seems like your index is fairly small because optimize
runs pretty quickly in both cases.  But that would mean (I think)
you're not actually flushing very many segments since you have a high
RAM buffer size (42 MB).  So then I'm baffled why merge policy would
be changing your numbers so much because your 4000 doc test should not
(I think?) actually be doing that much merging.

Are you creating the index from scratch in each test?  How large is
the resulting index?  Are you using FSDirectory?

I ran my own test on Wikipedia content.  I ran this alg:

  analyzer=org.apache.lucene.analysis.SimpleAnalyzer
  doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
  directory=FSDirectory
  docs.file=/lucene/wikifull.txt

  ram.flush.mb=42
  max.field.length=2147483647
  merge.factor=3
  compound=false
  autocommit=false

  doc.maker.forever=false
  doc.add.log.step=5000

  ResetSystemErase
  CreateIndex
  {AddDoc >: *
  Optimize
  CloseIndex

  RepSumByName

to index all of wikipedia with the same params you're using (flush @
42 MB, compound false, merge factor 3).

LogByteSizeMergePolicy (the current default) gives this output (times
are best of 2 runs):

  indexing 1198 sec
  optimize  282 sec

LogDocMergePolicy took this long

  indexing 1216 sec
  optimize  270 sec

I think those numbers are "within the noise" of each other, ie pretty
much the same.  This is what I would expect.  So we need to figure out
why I'm seeing different results than you.

Can you call writer.setInfoStream(System.out) and attach the resulting
output from each of your 4000 doc runs?  Thanks!


> Change defaults in IndexWriter to maximize "out of the box" performance
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-994
>                 URL: https://issues.apache.org/jira/browse/LUCENE-994
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.3
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-994.patch
>
>
> This is follow-through from LUCENE-845, LUCENE-847 and LUCENE-870;
> I'll commit this once those three are committed.
> Out of the box performance of IndexWriter is maximized when flushing
> by RAM instead of a fixed document count (the default today) because
> documents can vary greatly in size.
> Likewise, merging performance should be faster when merging by net
> segment size since, to minimize the net IO cost of merging segments
> over time, you want to merge segments of equal byte size.
> Finally, ConcurrentMergeScheduler improves indexing speed
> substantially (25% in a simple initial test in LUCENE-870) because it
> runs the merges in the backround and doesn't block
> add/update/deleteDocument calls.  Most machines have concurrency
> between CPU and IO and so it makes sense to default to this
> MergeScheduler.
> Note that these changes will break users of ParallelReader because the
> parallel indices will no longer have matching docIDs.  Such users need
> to switch IndexWriter back to flushing by doc count, and switch the
> MergePolicy back to LogDocMergePolicy.  It's likely also necessary to
> switch the MergeScheduler back to SerialMergeScheduler to ensure
> deterministic docID assignment.
> I think the combination of these three default changes, plus other
> performance improvements for indexing (LUCENE-966, LUCENE-843,
> LUCENE-963, LUCENE-969, LUCENE-871, etc.) should make for some sizable
> performance gains Lucene 2.3!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-994) Change defaults in IndexWriter to maximize "out of the box" performance

Reply via email to