Radu Ghita created SOLR-5075:
--------------------------------
Summary: SolrCloud commit process is too time consuming, even if
documents are light
Key: SOLR-5075
URL: https://issues.apache.org/jira/browse/SOLR-5075
Project: Solr
Issue Type: Bug
Components: Schema and Analysis, SolrCloud
Affects Versions: 4.1
Environment: SolrCloud 4.1, internal Zookeeper, 16 shards, custom java
importer.
Server: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz, 32 cores, 192gb RAM, 10tb SSD
and 50tb SAS memory
Reporter: Radu Ghita
We are having a client with business model that requires indexing each month
billion rows into solr from mysql in a small time-frame. The documents are very
light, but the number is very high and we need to achieve speeds of around
80-100k/s. The built in solr indexer goes to 40-50k tops, but after some hours
( ~12 ) it crashes and the speed slows down as hours go by.
Therefore we have developed a custom java importer that connects directly to
mysql and solrcloud via zookeeper, grabs data from mysql, creates documents and
then imports into solr. This helps because we are opening ~50 threads and the
indexing process speeds up. We have optimized the mysql queries ( mysql was the
initial bottleneck ) and the speeds we get now are over 100k/s, but as index
number gets bigger, solr stays very long on adding documents. I assume it needs
to be something from solrconfig that makes solr stay and even block after 100
mil documents indexed.
Here is the java code that creates documents and then adds to solr server:
public void createDocuments() throws SQLException, SolrServerException,
IOException
{
App.logger.write("Creating documents..");
this.docs = new ArrayList<SolrInputDocument>();
App.logger.incrementNumberOfRows(this.size);
while(this.results.next())
{
this.docs.add(this.getDocumentFromResultSet(this.results));
}
this.statement.close();
this.results.close();
}
public void commitDocuments() throws SolrServerException, IOException
{
App.logger.write("Committing..");
App.solrServer.add(this.docs); // here it stays very long and
then blocks
App.logger.incrementNumberOfRows(this.docs.size());
this.docs.clear();
}
I am also pasting solrconfig.xml parameters that make sense to this discussion:
<maxIndexingThreads>128</maxIndexingThreads>
<useCompoundFile>false</useCompoundFile>
<ramBufferSizeMB>10000</ramBufferSizeMB>
<maxBufferedDocs>1000000</maxBufferedDocs>
<mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
<int name="maxMergeAtOnce">20000</int>
<int name="segmentsPerTier">1000000</int>
<int name="maxMergeAtOnceExplicit">10000</int>
</mergePolicy>
<mergeFactor>100</mergeFactor>
<termIndexInterval>1024</termIndexInterval>
<autoCommit>
<maxTime>15000</maxTime>
<maxDocs>1000000</maxDocs>
<openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>2000000</maxTime>
</autoSoftCommit>
Thanks a lot for any answers and excuse my long text, I'm new to this JIRA. If
there's any other info needed please let me know.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]