[jira] [Commented] (CASSANDRA-9302) Optimize cqlsh COPY FROM, part 3

Stefania (JIRA) Tue, 15 Dec 2015 01:37:31 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15057710#comment-15057710
 ]


Stefania commented on CASSANDRA-9302:
-------------------------------------

bq. Now that we're not choosing session based on replica host, we might further 
simplify split_batches to just group by partition key (i.e., no need for 
get_replica). Alternatively, if you want to send to a specific host other than 
one that load balancing would choose, we would need to borrow a connection and 
send directly on that (I don't think that's worth doing).

We need to batch by replica rather than just by partition key as the scope is 
much wider. Initially I was batching only by primary key but that gave very bad 
results for workloads with unique primary keys, like the one we normally use to 
benchmark these tools, _cassandra-stress_. If the current approach does not 
guarantee we contact the same host then we must borrow a connection to ensure 
that's the case or revert back to individual sessions, since we do have a cap 
of max_requests, we would have to ensure sessions are closed when we are 
finished with them rather than at the very end.

INGESTRATE is used to throttle sending more work but it cannot be smaller than 
a single workload unit (chunk_size * max_requests * num_processes). I'll update 
the documentation at a minimum, or see if this can be simplified.

I'll fix the other two minor points as well, so moving back to in progress.

> Optimize cqlsh COPY FROM, part 3
> --------------------------------
>
>                 Key: CASSANDRA-9302
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9302
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Jonathan Ellis
>            Assignee: Stefania
>            Priority: Critical
>             Fix For: 2.1.x
>
>
> We've had some discussion moving to Spark CSV import for bulk load in 3.x, 
> but people need a good bulk load tool now.  One option is to add a separate 
> Java bulk load tool (CASSANDRA-9048), but if we can match that performance 
> from cqlsh I would prefer to leave COPY FROM as the preferred option to which 
> we point people, rather than adding more tools that need to be supported 
> indefinitely.
> Previous work on COPY FROM optimization was done in CASSANDRA-7405 and 
> CASSANDRA-8225.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9302) Optimize cqlsh COPY FROM, part 3

Reply via email to