While reviewing recent changes to the Solr Reference Guide, I came
across a mention of LUCENE-6063, which I hadn't noticed before.

https://issues.apache.org/jira/browse/LUCENE-6063

People using Solr's dataimport handler sometimes run into a show-stopper
problem with the CMS denial-of-service protection:  When importing
millions of documents from a database and using the default CMS config,
eventually the segment merging will become busy enough and large enough
that the incoming thread is stalled for several minutes.  Some (possibly
most) JDBC implementations will disconnect from the database when this
happens, ending the import with an error once the incoming thread is
resumed.

Historically, I have advised users with this problem to increase
maxMergeCount to 6.  This ensures that enough simultaenous marges are
allowed so that the incoming thread never stalls.  If the import were
large enough, increasing maxMergeCount beyond 6 might be required, but
I've never seen it.

Now I'm wondering if there might be a way for Solr and/or Lucene to keep
this from happening at all with out-of-the-box config.

One thing we could do is increase the default maxMergeThreads to 6.  I'm
guessing that the default in Lucene was intentionally chosen with care,
so I would expect that it will only be acceptable to change it for
Solr.  Changing the internal default in Solr might only be feasible when
no merge config is present, so we would also need to change the example
configs, perhaps just the dih-database example.  At the very least, a
comment in the DIH config would be helpful.

If there is a way in JDBC to ask the driver to keep a connection alive
even though it's inactive, the DIH code could be modified to use it. 
Overall indexing performance will not be the best because of the
indexing stall, but it would work without error.

Lastly, CMS itself could be modified to severely throttle the incoming
thread rather than completely stall it.  If the throttled rate were high
enough to ensure that the JDBC connection saw activity at least once
every 15 seconds, that would be enough for any sanely configured
database.  I'm mentally tossing around an idea for a "maxStallTime"
option instead of actual throttling ... something that would allow the
incoming thread to alternate running and stalling for certain time periods.

I will also look into whether the DIH config, wiki, and reference guide
include mention of this problem, and fix them if necessary.

Thanks,
Shawn


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to