[jira] [Comment Edited] (CASSANDRA-9302) Optimize cqlsh COPY FROM, part 3

Stefania (JIRA) Mon, 02 Nov 2015 01:18:06 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984916#comment-14984916
 ]


Stefania edited comment on CASSANDRA-9302 at 11/2/15 9:16 AM:
--------------------------------------------------------------

So far the most time consuming thing to implement has been text parsing in 
order to support prepared statements and the associated tests with composites 
and so forth. This should be done now. The biggest gain comes from batching 
however. According to the python profiler, we spend most of the time sending 
requests to the server; we cannot afford to do this for each statement 
especially if we want to take advantage of TAR and connection pools in the 
driver, we must call {{execute_async()}} therefore increasing the cost per 
request. Even batches as small as 10 statements have a huge impact as they 
reduce the work by a factor 10.  

I propose to batch as follows: pass to worker processes a big batch, approx 
1000 statements (configurable). Each worker process than checks if it can group 
these entries by PK. If a PK group is more than 10 entries (configurable) we 
send this as a batch. Else we aggregate the remaining statements in a single 
batch.

I've also added back-off and recovery, therefore CASSANDRA-9061 can be closed 
as a duplicate of this ticket.


was (Author: stefania):
So far the most time consuming thing to implement has been text parsing in 
order to support prepared statements and the associated tests with composites 
and so forth. This should be done now. The biggest gain comes from batching 
however. According to the python profiler, we spend most of the time creating 
messages to send to the server; we cannot afford to do this for each statement 
especially if we want to take advantage of TAR and connection pools in the 
driver, we must call {{execute_async()}} therefore increasing the cost per 
requested compared to creating a message passed directly to the connection 
(which is what we currently do). Even batches as small as 10 statements have a 
huge impact as they reduce the work by a factor 10.  

I propose to batch as follows: pass to worker processes a big batch, approx 
1000 statements (configurable). Each worker process than checks if it can group 
these entries by PK. If a PK group is more than 10 entries (configurable) we 
send this as a batch. Else we aggregate the remaining statements in a single 
batch.

I've also added back-off and recovery, therefore CASSANDRA-9061 can be closed 
as a duplicate of this ticket.

> Optimize cqlsh COPY FROM, part 3
> --------------------------------
>
>                 Key: CASSANDRA-9302
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9302
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Jonathan Ellis
>            Assignee: Stefania
>            Priority: Critical
>             Fix For: 2.1.x
>
>
> We've had some discussion moving to Spark CSV import for bulk load in 3.x, 
> but people need a good bulk load tool now.  One option is to add a separate 
> Java bulk load tool (CASSANDRA-9048), but if we can match that performance 
> from cqlsh I would prefer to leave COPY FROM as the preferred option to which 
> we point people, rather than adding more tools that need to be supported 
> indefinitely.
> Previous work on COPY FROM optimization was done in CASSANDRA-7405 and 
> CASSANDRA-8225.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-9302) Optimize cqlsh COPY FROM, part 3

Reply via email to