[jira] [Commented] (CASSANDRA-9304) COPY TO improvements
[ https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14726901#comment-14726901 ] David Kua commented on CASSANDRA-9304: -- [~Stefania] Updates to my 9304 branch now include a parameter for the COPY command that allows for number of jobs to be configured. RateMeter was also changed and fixed up as an issue was found during testing. Testing also found issues with ByteOrderedPartitioner and OrderPreservingPartitioner. Mainly that BOP's tokens don't work with the SELECT statements I'm using and OPP has no token ring so can't be parallelized. So changes were made to cause COPY TO to run as if it were single process when it encounters those two partitioners. Tests were updated and can be found here: https://github.com/dkua/cassandra-dtest/tree/bulk_export The cqlsh COPY tests now run with a cluster of 3 nodes and the tests have increased from testing 1k rows to 10k rows. One of the read/write tests now tests different partitioners also and should cover that case perfectly fine. > COPY TO improvements > > > Key: CASSANDRA-9304 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9304 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Jonathan Ellis >Assignee: David Kua >Priority: Minor > Labels: cqlsh > Fix For: 2.1.x > > > COPY FROM has gotten a lot of love. COPY TO not so much. One obvious > improvement could be to parallelize reading and writing (write one page of > data while fetching the next). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9777) If you have a ~/.cqlshrc and a ~/.cassandra/cqlshrc, cqlsh will overwrite the latter with the former
[ https://issues.apache.org/jira/browse/CASSANDRA-9777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659093#comment-14659093 ] David Kua commented on CASSANDRA-9777: -- https://github.com/dkua/cassandra/tree/cass-9777 I've rebased the branch to be up to date with the current 2.2 branch and added a commit that changed the warning message to be more detailed as per [~thobbs] suggestion. If you have a ~/.cqlshrc and a ~/.cassandra/cqlshrc, cqlsh will overwrite the latter with the former Key: CASSANDRA-9777 URL: https://issues.apache.org/jira/browse/CASSANDRA-9777 Project: Cassandra Issue Type: Bug Reporter: Jon Moses Assignee: David Kua Labels: cqlsh Fix For: 2.2.x If you have a .cqlshrc file, and a ~/.cassandra/cqlshrc file, when you run `cqlsh`, it will overwrite the latter with the former. https://github.com/apache/cassandra/blob/trunk/bin/cqlsh#L202 If the 'new' path exists (~/.cassandra/cqlsh), cqlsh should either WARN or just leave the files alone. {noformat} ~$ cat .cqlshrc [authentication] ~$ cat .cassandra/cqlshrc [connection] ~$ cqlsh ~$ cat .cqlshrc cat: .cqlshrc: No such file or directory ~$ cat .cassandra/cqlshrc [authentication] ~$ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9304) COPY TO improvements
[ https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646980#comment-14646980 ] David Kua commented on CASSANDRA-9304: -- [~Stefania] thank you! I've updated the 9304 branch to resolve most of the points you wrote. I still need to test capping the number of processes at 4. 12 jobs was just the number of jobs that could be chained at once before the cluster would fail for me. However I was testing within a vagrant box and will be testing on my base machine soon. I couldn't think of a better dynamic number :/ COPY TO improvements Key: CASSANDRA-9304 URL: https://issues.apache.org/jira/browse/CASSANDRA-9304 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: David Kua Priority: Minor Labels: cqlsh Fix For: 2.1.x COPY FROM has gotten a lot of love. COPY TO not so much. One obvious improvement could be to parallelize reading and writing (write one page of data while fetching the next). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9304) COPY TO improvements
[ https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633984#comment-14633984 ] David Kua edited comment on CASSANDRA-9304 at 7/20/15 8:44 PM: --- https://github.com/dkua/cassandra/tree/9304 In the above branch are my improvements to COPY TO. Which basically amounts to figuring out the token ranges from the token ring, starting some subprocesses, giving each subprocess a subset of the ranges, and have them perform the queries asynchronously and pass each formatted page back to the parent process to write to the CSV file. The resulting CSV is unordered so changes to the dtests needed to be made, see here: https://github.com/dkua/cassandra-dtest/tree/bulk_export They have also been submitted to the dtest repo on Github as a PR. - A small benchmark was done on a table of 10M rows inside of a Vagrant box with 8 cores. The table was created using the following command `tools/bin/cassandra-stress write n=10M -rate threads=50`. The original single proc version took about 30 minutes to export the table. The multi proc version takes about 7 minutes. [~brianmhess]'s cassandra-unloader takes a little over 2 minutes. was (Author: dkua): https://github.com/dkua/cassandra/tree/9304 In the above branch are my improvements to COPY TO. Which basically amounts to figuring out the token ranges from the token ring, starting some subprocesses, giving each subprocess a subset of the ranges, and have them perform the queries asynchronously and pass each formatted page back to the parent process to write to the CSV file. The resulting CSV is unordered so changes to the dtests needed to be made, see here: https://github.com/dkua/cassandra-dtest/tree/bulk_export They have also been submitted to the dtest repo on Github as a PR. - A small benchmark was done on a table of 10M rows inside of a Vagrant box with 8 cores. The table was created using the following command `tools/bin/cassandra-stress write n=10M -rate threads=50`. The original single proc version took about 30 minutes to export the table. The multi proc version takes about 7 minutes. COPY TO improvements Key: CASSANDRA-9304 URL: https://issues.apache.org/jira/browse/CASSANDRA-9304 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: David Kua Priority: Minor Labels: cqlsh Fix For: 2.1.x COPY FROM has gotten a lot of love. COPY TO not so much. One obvious improvement could be to parallelize reading and writing (write one page of data while fetching the next). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9304) COPY TO improvements
[ https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633984#comment-14633984 ] David Kua edited comment on CASSANDRA-9304 at 7/20/15 7:41 PM: --- https://github.com/dkua/cassandra/tree/9304 In the above branch are my improvements to COPY TO. Which basically amounts to figuring out the token ranges from the token ring, starting some subprocesses, giving each subprocess a subset of the ranges, and have them perform the queries asynchronously and pass each formatted page back to the parent process to write to the CSV file. The resulting CSV is unordered so changes to the dtests needed to be made, see here: https://github.com/dkua/cassandra-dtest/tree/bulk_export They have also been submitted to the dtest repo on Github as a PR. - A small benchmark was done on a table of 10M rows inside of a Vagrant box with 8 cores. The table was created using the following command `tools/bin/cassandra-stress write n=10M -rate threads=50`. The original single proc version took about 30 minutes to export the table. The multi proc version takes about 7 minutes. was (Author: dkua): https://github.com/dkua/cassandra/tree/9304 In the above branch are my improvements to COPY TO. Which basically amounts to figuring out the token ranges from the token ring, starting some subprocesses, giving each subprocess a subset of the ranges, and have them perform the queries asynchronously and pass each formatted page back to the parent process to write to the CSV file. The resulting CSV is unordered so changes to the dtests needed to be made. They have been submitted to the dtest repo on Github as a PR. - A small benchmark was done on a table of 10M rows inside of a Vagrant box with 8 cores. The table was created using the following command `tools/bin/cassandra-stress write n=10M -rate threads=50`. The original single proc version took about 30 minutes to export the table. The multi proc version takes about 7 minutes. COPY TO improvements Key: CASSANDRA-9304 URL: https://issues.apache.org/jira/browse/CASSANDRA-9304 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: David Kua Priority: Minor Labels: cqlsh Fix For: 2.1.x COPY FROM has gotten a lot of love. COPY TO not so much. One obvious improvement could be to parallelize reading and writing (write one page of data while fetching the next). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9304) COPY TO improvements
[ https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633984#comment-14633984 ] David Kua edited comment on CASSANDRA-9304 at 7/20/15 7:39 PM: --- https://github.com/dkua/cassandra/tree/9304 In the above branch are my improvements to COPY TO. Which basically amounts to figuring out the token ranges from the token ring, starting some subprocesses, giving each subprocess a subset of the ranges, and have them perform the queries asynchronously and pass each formatted page back to the parent process to write to the CSV file. The resulting CSV is unordered so changes to the dtests needed to be made. They have been submitted to the dtest repo on Github as a PR. - A small benchmark was done on a table of 10M rows inside of a Vagrant box with 8 cores. The table was created using the following command `tools/bin/cassandra-stress write n=10M -rate threads=50`. The original single proc version took about 30 minutes to export the table. The multi proc version takes about 7 minutes. was (Author: dkua): https://github.com/dkua/cassandra/tree/9304 In the above branch are my improvements to COPY TO. Which basically amounts to figuring out the token ranges from the token ring, starting some subprocesses, giving each subprocess a subset of the ranges, and have them perform the queries asynchronously and pass each formatted page back to the parent process to write to the CSV file. The resulting CSV is unordered so changes to the dtests needed to be made. They have been submitted to the dtest repo on Github as a PR. COPY TO improvements Key: CASSANDRA-9304 URL: https://issues.apache.org/jira/browse/CASSANDRA-9304 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: David Kua Priority: Minor Labels: cqlsh Fix For: 2.1.x COPY FROM has gotten a lot of love. COPY TO not so much. One obvious improvement could be to parallelize reading and writing (write one page of data while fetching the next). -- This message was sent by Atlassian JIRA (v6.3.4#6332)