[jira] [Commented] (LUCENE-5705) ConcurrentMergeScheduler/maxMergeCount default is too low

Shai Erera (JIRA) Sun, 25 May 2014 11:28:23 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14008396#comment-14008396
 ]


Shai Erera commented on LUCENE-5705:
------------------------------------

It depends what would you like to achieve. If you import documents that amount 
to 100 segments and care only about the end result, i.e. a merged index (per 
the MP settings), then I am not sure it will matter much if you first import 
w/o merging, and then call maybeMerge(). But if you care about how fast DIH 
finishes importing, and are willing to let merges run in the background while 
e.g. the index is searched, then disabling merges while you import data will 
improve latency in that respect.

When I experimented with building indexes on Hadoop, I always disabled merges 
while the index was built, and executed a special job afterwards. This 
prevented a lot of copying around HDFS. Not saying this is your case, but 
sometimes it's useful to turn off merges, when you're executing batch-oriented 
jobs.

> ConcurrentMergeScheduler/maxMergeCount default is too low
> ---------------------------------------------------------
>
>                 Key: LUCENE-5705
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5705
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/other
>    Affects Versions: 4.8
>            Reporter: Shawn Heisey
>            Assignee: Shawn Heisey
>            Priority: Minor
>             Fix For: 4.9
>
>         Attachments: LUCENE-5705.patch, LUCENE-5705.patch, dih-example.patch
>
>
> The default value for maxMergeCount in ConcurrentMergeScheduler is 2.  This 
> causes problems for Solr's dataimport handler when very large imports are 
> done from a JDBC source.
> What happens is that when three merge tiers are scheduled at the same time, 
> the add/update thread will stop for several minutes while the largest merge 
> finishes.  In the meantime, the dataimporter JDBC connection to the database 
> will time out, and when the add/update thread resumes, the import will fail 
> because the ResultSet throws an exception.  Setting maxMergeCount to 6 
> eliminates this issue for virtually any size import -- although it is 
> theoretically possible to have that many simultaneous merge tiers, I've never 
> seen it.
> As long as maxThreads is properly set (the default value of 1 is appropriate 
> for most installations), I cannot think of a really good reason that the 
> default for maxMergeCount should be so low.  If someone does need to strictly 
> control the number of threads that get created, they can reduce the number.  
> Perhaps someone with more experience knows of a really good reason to make 
> this default low?
> I'm not sure what the new default number should be, but I'd like to avoid 
> bikeshedding.  I don't think it should be Integer.MAX_VALUE.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5705) ConcurrentMergeScheduler/maxMergeCount default is too low

Reply via email to