[ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Hobbs updated CASSANDRA-8225:
-----------------------------------
    Attachment: 8225-2.1.txt

The attached patch (and 
[branch|https://github.com/thobbs/cassandra/tree/CASSANDRA-8225]) gives us 
roughly a 10x speedup on COPY FROM:

{noformat}
cqlsh:ks1> COPY foo FROM 'stuff.csv' ;
446736 rows imported in 16.667 seconds.rows/s
{noformat}

On my laptop, I see between 40k and 55k inserts per second until Cassandra 
starts to flush and garbage collect, which tends to slow things down (cqlsh and 
C* are competing for resources).  Even with those problems, it averages about 
30k rows per second.  I imagine that if C* were run on a separate machine, ~45k 
would be a more typical average.

> Production-capable COPY FROM
> ----------------------------
>
>                 Key: CASSANDRA-8225
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Tools
>            Reporter: Jonathan Ellis
>            Assignee: Tyler Hobbs
>             Fix For: 2.1.3
>
>         Attachments: 8225-2.1.txt
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 500000 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 500000 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 500000 rows; Write: 4037.46 rows/s
> 500000 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to