Jim Zamata created CASSANDRA-5741:
-------------------------------------

             Summary: Provide a way to disable automatic index rebuilds during 
bulk loading
                 Key: CASSANDRA-5741
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5741
             Project: Cassandra
          Issue Type: Improvement
          Components: Hadoop
    Affects Versions: 1.2.6
            Reporter: Jim Zamata


When using the BulkLoadOutputFormat the actual streaming of the SSTables into 
Cassandra is fast, but the index rebuilds can take several minutes. Cassandra 
does not send the response until after all of the rebuilds for a streaming 
session complete. This causes the tasks to appear to hang at 100%, since the 
record writer streams the files in its close method.  If the rebuilding process 
takes too long, the tasks can actually time out.

Many SQL databases provide bulk insert utilities that disable index updates to 
allow large amounts of data to be added quickly.  This functionality would 
serve a similar purpose.

An alternative might be an option that would allow the session to return once 
the SSTables had been successfully imported without waiting for the index 
builds to complete.  However, I have noticed heavy CPU loads during the index 
rebuilds, so bulkload performance might be better if this step could be 
deferred until after all of the data is loaded. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to