[jira] [Created] (SOLR-5075) SolrCloud commit process is too time consuming, even if documents are light

Radu Ghita (JIRA) Wed, 24 Jul 2013 23:44:18 -0700

Radu Ghita created SOLR-5075:
--------------------------------

             Summary: SolrCloud commit process is too time consuming, even if 
documents are light
                 Key: SOLR-5075
                 URL: https://issues.apache.org/jira/browse/SOLR-5075
             Project: Solr
          Issue Type: Bug
          Components: Schema and Analysis, SolrCloud
    Affects Versions: 4.1
         Environment: SolrCloud 4.1, internal Zookeeper, 16 shards, custom java 
importer.
Server: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz, 32 cores, 192gb RAM, 10tb SSD 
and 50tb SAS memory
            Reporter: Radu Ghita



We are having a client with business model that requires indexing each month 
billion rows into solr from mysql in a small time-frame. The documents are very 
light, but the number is very high and we need to achieve speeds of around 
80-100k/s. The built in solr indexer goes to 40-50k tops, but after some hours 
( ~12 ) it crashes and the speed slows down as hours go by.

Therefore we have developed a custom java importer that connects directly to 
mysql and solrcloud via zookeeper, grabs data from mysql, creates documents and 
then imports into solr. This helps because we are opening ~50 threads and the 
indexing process speeds up. We have optimized the mysql queries ( mysql was the 
initial bottleneck ) and the speeds we get now are over 100k/s, but as index 
number gets bigger, solr stays very long on adding documents. I assume it needs 
to be something from solrconfig that makes solr stay and even block after 100 
mil documents indexed.

Here is the java code that creates documents and then adds to solr server:

public void createDocuments() throws SQLException, SolrServerException, 
IOException
        {
                App.logger.write("Creating documents..");
                this.docs = new ArrayList<SolrInputDocument>();
                App.logger.incrementNumberOfRows(this.size);
                while(this.results.next())
                {
                           
this.docs.add(this.getDocumentFromResultSet(this.results));

                }
                this.statement.close();
                this.results.close();
        }
        
        public void commitDocuments() throws SolrServerException, IOException
        {
                App.logger.write("Committing..");
                App.solrServer.add(this.docs); // here it stays very long and 
then blocks
                App.logger.incrementNumberOfRows(this.docs.size());
                this.docs.clear();
        }

I am also pasting solrconfig.xml parameters that make sense to this discussion:
<maxIndexingThreads>128</maxIndexingThreads>
<useCompoundFile>false</useCompoundFile>
<ramBufferSizeMB>10000</ramBufferSizeMB>
<maxBufferedDocs>1000000</maxBufferedDocs>
<mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
          <int name="maxMergeAtOnce">20000</int>
          <int name="segmentsPerTier">1000000</int>
          <int name="maxMergeAtOnceExplicit">10000</int>
</mergePolicy>
<mergeFactor>100</mergeFactor>
<termIndexInterval>1024</termIndexInterval>
<autoCommit> 
       <maxTime>15000</maxTime> 
       <maxDocs>1000000</maxDocs>
       <openSearcher>false</openSearcher> 
     </autoCommit>
<autoSoftCommit> 
         <maxTime>2000000</maxTime> 
       </autoSoftCommit>

Thanks a lot for any answers and excuse my long text, I'm new to this JIRA. If 
there's any other info needed please let me know.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SOLR-5075) SolrCloud commit process is too time consuming, even if documents are light

Reply via email to