[jira] [Commented] (LUCENE-5705) ConcurrentMergeScheduler/maxMergeCount default is too low

Shawn Heisey (JIRA) Sat, 24 May 2014 08:54:23 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14008152#comment-14008152
 ]


Shawn Heisey commented on LUCENE-5705:
--------------------------------------

bq. Do you know why merges can't keep up in your use case? E.g. are you 
throttling the merge IO?

I have the TMP equivalent of mergeFactor 35, and I'm importing 16 million docs 
from MySQL into each shard.  The final shard size is over 18GB.  I've seen the 
same thing happen to others with the default mergeFactor.  ramBufferSizeMB is 
48.  I have no throttling config.  The index is on a RAID10 volume comprised of 
six 1TB SATA disks with a 1MB stripe size, so it's not slow.  It just takes 
several minutes to merge at the gigabyte scale.

Recently I added autoCommit at 25000 docs with openSearcher=false, which I 
think does reduce the size of each initial segment a little bit, but I have not 
tried again with the default maxMergeCount.  I've had mine at 6 for years now, 
and others have had their import problems fixed with that setting.


> ConcurrentMergeScheduler/maxMergeCount default is too low
> ---------------------------------------------------------
>
>                 Key: LUCENE-5705
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5705
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/other
>    Affects Versions: 4.8
>            Reporter: Shawn Heisey
>            Assignee: Shawn Heisey
>            Priority: Minor
>             Fix For: 4.9
>
>         Attachments: LUCENE-5705.patch, LUCENE-5705.patch
>
>
> The default value for maxMergeCount in ConcurrentMergeScheduler is 2.  This 
> causes problems for Solr's dataimport handler when very large imports are 
> done from a JDBC source.
> What happens is that when three merge tiers are scheduled at the same time, 
> the add/update thread will stop for several minutes while the largest merge 
> finishes.  In the meantime, the dataimporter JDBC connection to the database 
> will time out, and when the add/update thread resumes, the import will fail 
> because the ResultSet throws an exception.  Setting maxMergeCount to 6 
> eliminates this issue for virtually any size import -- although it is 
> theoretically possible to have that many simultaneous merge tiers, I've never 
> seen it.
> As long as maxThreads is properly set (the default value of 1 is appropriate 
> for most installations), I cannot think of a really good reason that the 
> default for maxMergeCount should be so low.  If someone does need to strictly 
> control the number of threads that get created, they can reduce the number.  
> Perhaps someone with more experience knows of a really good reason to make 
> this default low?
> I'm not sure what the new default number should be, but I'd like to avoid 
> bikeshedding.  I don't think it should be Integer.MAX_VALUE.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5705) ConcurrentMergeScheduler/maxMergeCount default is too low

Reply via email to