[
https://issues.apache.org/jira/browse/CASSANDRA-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058299#comment-15058299
]
Stefania commented on CASSANDRA-9302:
-------------------------------------
Thanks Adam. As discussed, here are two possible follow-ups:
* The ingest rate only works correctly if chunk size << ingest rate since we
still send at least one chunk at a time.
* The 6 seconds improvement noted when I reverted to batching by primary key
rather than by replica, is caused by a slow lookup in the token map (bisect
right). The driver TAR only performs one lookup per batch whilst to batch by
replica we must perform one lookup per record. In order to make batching by
replica viable, which should be faster in theory, we must optimize the TM
lookup but this is not easy to do. Provided we have at least one local replica
this should not be worth it but we may want to revisit this for non local
clusters if the need arises.
> Optimize cqlsh COPY FROM, part 3
> --------------------------------
>
> Key: CASSANDRA-9302
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9302
> Project: Cassandra
> Issue Type: Improvement
> Components: Tools
> Reporter: Jonathan Ellis
> Assignee: Stefania
> Priority: Critical
> Fix For: 2.1.x
>
>
> We've had some discussion moving to Spark CSV import for bulk load in 3.x,
> but people need a good bulk load tool now. One option is to add a separate
> Java bulk load tool (CASSANDRA-9048), but if we can match that performance
> from cqlsh I would prefer to leave COPY FROM as the preferred option to which
> we point people, rather than adding more tools that need to be supported
> indefinitely.
> Previous work on COPY FROM optimization was done in CASSANDRA-7405 and
> CASSANDRA-8225.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)