[ 
https://issues.apache.org/jira/browse/CASSANDRA-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058150#comment-15058150
 ] 

Stefania commented on CASSANDRA-9302:
-------------------------------------

It seems the poor performance of my initial tests with batching by pk must have 
been caused by using the wrong loading policy (not DC aware) or by an incorrect 
primary key construction. I've repeated the tests today with pk batching rather 
than replica batching and it shaved another 6 seconds. So I think there is 
nothing else to do here. I can import 1M records on my machine in just under 21 
seconds with 3 nodes active (previously it was about 27 seconds when contacting 
replicas directly or when batching by replica).

Regarding the ingest rate, I simplified the logic so that we no longer use the 
max requests but rely entirely on the ingest rate. I also fixed a bug and added 
the average rate to the rate meter. The rate meter was only displaying the rate 
of the last period, slightly smoothed, but this is not necessarily the ingest 
rate and so it was a bit confusing. The average rate is instead much closer to 
the ingest rate. I also introduced fixed formatting to avoid the double 's'.

Now that the ingest is working properly, I've increased it to 100K by default - 
since I don't want to have a default value that could be a bottleneck. I've 
also changed the report frequency to seconds rather than number of rows since 
it seems more natural that way but also it made life easier in supporting the 
ingest rate.

Finally, the numeric option values have also been fixed.

CI is still pending.

> Optimize cqlsh COPY FROM, part 3
> --------------------------------
>
>                 Key: CASSANDRA-9302
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9302
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Jonathan Ellis
>            Assignee: Stefania
>            Priority: Critical
>             Fix For: 2.1.x
>
>
> We've had some discussion moving to Spark CSV import for bulk load in 3.x, 
> but people need a good bulk load tool now.  One option is to add a separate 
> Java bulk load tool (CASSANDRA-9048), but if we can match that performance 
> from cqlsh I would prefer to leave COPY FROM as the preferred option to which 
> we point people, rather than adding more tools that need to be supported 
> indefinitely.
> Previous work on COPY FROM optimization was done in CASSANDRA-7405 and 
> CASSANDRA-8225.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to