[
https://issues.apache.org/jira/browse/LUCENE-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14008611#comment-14008611
]
Michael McCandless commented on LUCENE-5705:
--------------------------------------------
CMS has been the default for a long time now (even back when LogMP was the
default). TMP won't change things that much for the append-only case.
I think even on fast disks your merging can fall behind: it's a question of
whether the indexing threads can produce segments faster than merging can
consolidate them. Also, the amount of free RAM that the OS can use for
readahead on the files opened for merging its can have a big impact on merge
performance.
If you search "pausing thread" and "unpausing thread" you should see it pausing
the largest merge(s) when more than one are running. Search for "too many
merges; stalling..." to see when the harsh back-pressure kicks in.
Doing all merging in the end is somewhat dangerous; you should only do it if
you know you will do no searching on the index until the merging has completed.
I suspect it will net/net make indexing take longer because you are not
soaking up concurrency during indexing to get merges done.
Net/net it's really important that Lucene doesn't allow too many segments in
the index; the "harsh" back-pressure Lucene applies today (hard stall of ALL
indexing threads) is effective but ... harsh.
If we improved CMS to make this behavior "optional", so that by default it
continued its effective-but-harsh-back-pressure, but then an app (Solr, ES)
could instead do its own thing (ES throttles down to one indexing thread
instead of 0 that Lucene does), then Solr could do something similar here.
Maybe open a new issue for that? (Hmm: is Solr using multiple indexing threads
in your case...?).
> ConcurrentMergeScheduler/maxMergeCount default is too low
> ---------------------------------------------------------
>
> Key: LUCENE-5705
> URL: https://issues.apache.org/jira/browse/LUCENE-5705
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/other
> Affects Versions: 4.8
> Reporter: Shawn Heisey
> Assignee: Shawn Heisey
> Priority: Minor
> Fix For: 4.9
>
> Attachments: LUCENE-5705.patch, LUCENE-5705.patch, dih-example.patch,
> infostream-s0build-shard.zip
>
>
> The default value for maxMergeCount in ConcurrentMergeScheduler is 2. This
> causes problems for Solr's dataimport handler when very large imports are
> done from a JDBC source.
> What happens is that when three merge tiers are scheduled at the same time,
> the add/update thread will stop for several minutes while the largest merge
> finishes. In the meantime, the dataimporter JDBC connection to the database
> will time out, and when the add/update thread resumes, the import will fail
> because the ResultSet throws an exception. Setting maxMergeCount to 6
> eliminates this issue for virtually any size import -- although it is
> theoretically possible to have that many simultaneous merge tiers, I've never
> seen it.
> As long as maxThreads is properly set (the default value of 1 is appropriate
> for most installations), I cannot think of a really good reason that the
> default for maxMergeCount should be so low. If someone does need to strictly
> control the number of threads that get created, they can reduce the number.
> Perhaps someone with more experience knows of a really good reason to make
> this default low?
> I'm not sure what the new default number should be, but I'd like to avoid
> bikeshedding. I don't think it should be Integer.MAX_VALUE.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]