[jira] [Comment Edited] (CASSANDRA-8225) Production-capable COPY FROM

Ryan Svihla (JIRA) Mon, 03 Nov 2014 08:29:18 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194692#comment-14194692
 ]


Ryan Svihla edited comment on CASSANDRA-8225 at 11/3/14 4:27 PM:
-----------------------------------------------------------------

Equally baffling, but it's a frequent request with the pushback being barrier 
to entry when getting started. These are businesses that don't already have 
Hadoop or Spark and are using something relational to do analytics on. Now I'm 
happy to continue to explain to them, at scale, this cannot possibly work in 
any way shape or form. However, I do get the desire to do something "good 
enough" and export their giant file to their SAN, and then limp along while 
they let the rest of the org catch up with best practices for analytics on 
large data sets.

I view a better COPY FROM as a bridge to a better world, and another good way 
to get Cassandra into places that are new to the distributed world.


was (Author: rssvihla):
Equally baffling, but it's a frequent request with the pushback being barrier 
to entry when getting started. These are businesses that don't already have 
Hadoop or Spark and are using something sql server to do analytics on. Now I'm 
happy to continue to explain to them, at scale, this cannot possibly work in 
any way shape or form. However, I do get the desire to do something "good 
enough" and export their giant file to their SAN, and then limp along while 
they let the rest of the org catch up with best practices for analytics on 
large data sets.

I view a better COPY FROM as a bridge to a better world, and another good way 
to get Cassandra into places that are new to the distributed world.

> Production-capable COPY FROM
> ----------------------------
>
>                 Key: CASSANDRA-8225
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8225
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Tools
>            Reporter: Jonathan Ellis
>             Fix For: 2.1.2
>
>
> Via [~schumacr],
> bq. I pulled down a sourceforge data generator and created a moc file of 
> 500,000 rows that had an incrementing sequence number, date, and SSN. I then 
> used our COPY command and MySQL's LOAD DATA INFILE to load the file on my 
> Mac. Results were: 
> {noformat}
> mysql> load data infile '/Users/robin/dev/datagen3.txt'  into table p_test  
> fields terminated by ',';
> Query OK, 500000 rows affected (2.18 sec)
> {noformat}
> C* 2.1.0 (pre-CASSANDRA-7405)
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> 500000 rows imported in 16 minutes and 45.485 seconds.
> {noformat}
> Cassandra 2.1.1:
> {noformat}
> cqlsh:dev> copy p_test from '/Users/robin/dev/datagen3.txt' with 
> delimiter=',';
> Processed 500000 rows; Write: 4037.46 rows/s
> 500000 rows imported in 2 minutes and 3.058 seconds.
> {noformat}
> [jbellis] 7405 gets us almost an order of magnitude improvement.  
> Unfortunately we're still almost 2 orders slower than mysql.
> I don't think we can continue to tell people, "use sstableloader instead."  
> The number of users sophisticated enough to use the sstable writers is small 
> and (relatively) decreasing as our user base expands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-8225) Production-capable COPY FROM

Reply via email to