[
https://issues.apache.org/jira/browse/CASSANDRA-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15057710#comment-15057710
]
Stefania commented on CASSANDRA-9302:
-------------------------------------
bq. Now that we're not choosing session based on replica host, we might further
simplify split_batches to just group by partition key (i.e., no need for
get_replica). Alternatively, if you want to send to a specific host other than
one that load balancing would choose, we would need to borrow a connection and
send directly on that (I don't think that's worth doing).
We need to batch by replica rather than just by partition key as the scope is
much wider. Initially I was batching only by primary key but that gave very bad
results for workloads with unique primary keys, like the one we normally use to
benchmark these tools, _cassandra-stress_. If the current approach does not
guarantee we contact the same host then we must borrow a connection to ensure
that's the case or revert back to individual sessions, since we do have a cap
of max_requests, we would have to ensure sessions are closed when we are
finished with them rather than at the very end.
INGESTRATE is used to throttle sending more work but it cannot be smaller than
a single workload unit (chunk_size * max_requests * num_processes). I'll update
the documentation at a minimum, or see if this can be simplified.
I'll fix the other two minor points as well, so moving back to in progress.
> Optimize cqlsh COPY FROM, part 3
> --------------------------------
>
> Key: CASSANDRA-9302
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9302
> Project: Cassandra
> Issue Type: Improvement
> Components: Tools
> Reporter: Jonathan Ellis
> Assignee: Stefania
> Priority: Critical
> Fix For: 2.1.x
>
>
> We've had some discussion moving to Spark CSV import for bulk load in 3.x,
> but people need a good bulk load tool now. One option is to add a separate
> Java bulk load tool (CASSANDRA-9048), but if we can match that performance
> from cqlsh I would prefer to leave COPY FROM as the preferred option to which
> we point people, rather than adding more tools that need to be supported
> indefinitely.
> Previous work on COPY FROM optimization was done in CASSANDRA-7405 and
> CASSANDRA-8225.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)