[jira] [Commented] (CASSANDRA-9304) COPY TO improvements

2015-09-02 Thread David Kua (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14726901#comment-14726901
 ] 

David Kua commented on CASSANDRA-9304:
--

[~Stefania]

Updates to my 9304 branch now include a parameter for the COPY command that 
allows for number of jobs to be configured. RateMeter was also changed and 
fixed up as an issue was found during testing. Testing also found issues with 
ByteOrderedPartitioner and OrderPreservingPartitioner. Mainly that BOP's tokens 
don't work with the SELECT statements I'm using and OPP has no token ring so 
can't be parallelized. So changes were made to cause COPY TO to run as if it 
were single process when it encounters those two partitioners.

Tests were updated and can be found here: 
https://github.com/dkua/cassandra-dtest/tree/bulk_export
The cqlsh COPY tests now run with a cluster of 3 nodes and the tests have 
increased from testing 1k rows to 10k rows. One of the read/write tests now 
tests different partitioners also and should cover that case perfectly fine.

> COPY TO improvements
> 
>
> Key: CASSANDRA-9304
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9304
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Jonathan Ellis
>Assignee: David Kua
>Priority: Minor
>  Labels: cqlsh
> Fix For: 2.1.x
>
>
> COPY FROM has gotten a lot of love.  COPY TO not so much.  One obvious 
> improvement could be to parallelize reading and writing (write one page of 
> data while fetching the next).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9777) If you have a ~/.cqlshrc and a ~/.cassandra/cqlshrc, cqlsh will overwrite the latter with the former

2015-08-05 Thread David Kua (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659093#comment-14659093
 ] 

David Kua commented on CASSANDRA-9777:
--

https://github.com/dkua/cassandra/tree/cass-9777

I've rebased the branch to be up to date with the current 2.2 branch and added 
a commit that changed the warning message to be more detailed as per [~thobbs] 
suggestion.

 If you have a ~/.cqlshrc and a ~/.cassandra/cqlshrc, cqlsh will overwrite the 
 latter with the former
 

 Key: CASSANDRA-9777
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9777
 Project: Cassandra
  Issue Type: Bug
Reporter: Jon Moses
Assignee: David Kua
  Labels: cqlsh
 Fix For: 2.2.x


 If you have a .cqlshrc file, and a ~/.cassandra/cqlshrc file, when you run 
 `cqlsh`, it will overwrite the latter with the former.  
 https://github.com/apache/cassandra/blob/trunk/bin/cqlsh#L202
 If the 'new' path exists (~/.cassandra/cqlsh), cqlsh should either WARN or 
 just leave the files alone.
 {noformat}
 ~$ cat .cqlshrc
 [authentication]
 ~$ cat .cassandra/cqlshrc
 [connection]
 ~$ cqlsh
 ~$ cat .cqlshrc
 cat: .cqlshrc: No such file or directory
 ~$ cat .cassandra/cqlshrc
 [authentication]
 ~$
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9304) COPY TO improvements

2015-07-29 Thread David Kua (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646980#comment-14646980
 ] 

David Kua commented on CASSANDRA-9304:
--

[~Stefania] thank you!

I've updated the 9304 branch to resolve most of the points you wrote. I still 
need to test capping the number of processes at 4. 12 jobs was just the number 
of jobs that could be chained at once before the cluster would fail for me. 
However I was testing within a vagrant box and will be testing on my base 
machine soon. I couldn't think of a better dynamic number :/

 COPY TO improvements
 

 Key: CASSANDRA-9304
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9304
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: David Kua
Priority: Minor
  Labels: cqlsh
 Fix For: 2.1.x


 COPY FROM has gotten a lot of love.  COPY TO not so much.  One obvious 
 improvement could be to parallelize reading and writing (write one page of 
 data while fetching the next).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9304) COPY TO improvements

2015-07-20 Thread David Kua (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633984#comment-14633984
 ] 

David Kua edited comment on CASSANDRA-9304 at 7/20/15 8:44 PM:
---

https://github.com/dkua/cassandra/tree/9304

In the above branch are my improvements to COPY TO. Which basically amounts to 
figuring out the token ranges from the token ring, starting some subprocesses, 
giving each subprocess a subset of the ranges, and have them perform the 
queries asynchronously and pass each formatted page back to the parent process 
to write to the CSV file.

The resulting CSV is unordered so changes to the dtests needed to be made, see 
here: https://github.com/dkua/cassandra-dtest/tree/bulk_export They have also 
been submitted to the dtest repo on Github as a PR.

-

A small benchmark was done on a table of 10M rows inside of a Vagrant box with 
8 cores. The table was created using the following command 
`tools/bin/cassandra-stress write n=10M -rate threads=50`.

The original single proc version took about 30 minutes to export the table.
The multi proc version takes about 7 minutes.
[~brianmhess]'s cassandra-unloader takes a little over 2 minutes.


was (Author: dkua):
https://github.com/dkua/cassandra/tree/9304

In the above branch are my improvements to COPY TO. Which basically amounts to 
figuring out the token ranges from the token ring, starting some subprocesses, 
giving each subprocess a subset of the ranges, and have them perform the 
queries asynchronously and pass each formatted page back to the parent process 
to write to the CSV file.

The resulting CSV is unordered so changes to the dtests needed to be made, see 
here: https://github.com/dkua/cassandra-dtest/tree/bulk_export They have also 
been submitted to the dtest repo on Github as a PR.

-

A small benchmark was done on a table of 10M rows inside of a Vagrant box with 
8 cores. The table was created using the following command 
`tools/bin/cassandra-stress write n=10M -rate threads=50`.

The original single proc version took about 30 minutes to export the table.
The multi proc version takes about 7 minutes.

 COPY TO improvements
 

 Key: CASSANDRA-9304
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9304
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: David Kua
Priority: Minor
  Labels: cqlsh
 Fix For: 2.1.x


 COPY FROM has gotten a lot of love.  COPY TO not so much.  One obvious 
 improvement could be to parallelize reading and writing (write one page of 
 data while fetching the next).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9304) COPY TO improvements

2015-07-20 Thread David Kua (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633984#comment-14633984
 ] 

David Kua edited comment on CASSANDRA-9304 at 7/20/15 7:41 PM:
---

https://github.com/dkua/cassandra/tree/9304

In the above branch are my improvements to COPY TO. Which basically amounts to 
figuring out the token ranges from the token ring, starting some subprocesses, 
giving each subprocess a subset of the ranges, and have them perform the 
queries asynchronously and pass each formatted page back to the parent process 
to write to the CSV file.

The resulting CSV is unordered so changes to the dtests needed to be made, see 
here: https://github.com/dkua/cassandra-dtest/tree/bulk_export They have also 
been submitted to the dtest repo on Github as a PR.

-

A small benchmark was done on a table of 10M rows inside of a Vagrant box with 
8 cores. The table was created using the following command 
`tools/bin/cassandra-stress write n=10M -rate threads=50`.

The original single proc version took about 30 minutes to export the table.
The multi proc version takes about 7 minutes.


was (Author: dkua):
https://github.com/dkua/cassandra/tree/9304

In the above branch are my improvements to COPY TO. Which basically amounts to 
figuring out the token ranges from the token ring, starting some subprocesses, 
giving each subprocess a subset of the ranges, and have them perform the 
queries asynchronously and pass each formatted page back to the parent process 
to write to the CSV file.

The resulting CSV is unordered so changes to the dtests needed to be made. They 
have been submitted to the dtest repo on Github as a PR.

-

A small benchmark was done on a table of 10M rows inside of a Vagrant box with 
8 cores. The table was created using the following command 
`tools/bin/cassandra-stress write n=10M -rate threads=50`.

The original single proc version took about 30 minutes to export the table.
The multi proc version takes about 7 minutes.

 COPY TO improvements
 

 Key: CASSANDRA-9304
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9304
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: David Kua
Priority: Minor
  Labels: cqlsh
 Fix For: 2.1.x


 COPY FROM has gotten a lot of love.  COPY TO not so much.  One obvious 
 improvement could be to parallelize reading and writing (write one page of 
 data while fetching the next).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9304) COPY TO improvements

2015-07-20 Thread David Kua (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633984#comment-14633984
 ] 

David Kua edited comment on CASSANDRA-9304 at 7/20/15 7:39 PM:
---

https://github.com/dkua/cassandra/tree/9304

In the above branch are my improvements to COPY TO. Which basically amounts to 
figuring out the token ranges from the token ring, starting some subprocesses, 
giving each subprocess a subset of the ranges, and have them perform the 
queries asynchronously and pass each formatted page back to the parent process 
to write to the CSV file.

The resulting CSV is unordered so changes to the dtests needed to be made. They 
have been submitted to the dtest repo on Github as a PR.

-

A small benchmark was done on a table of 10M rows inside of a Vagrant box with 
8 cores. The table was created using the following command 
`tools/bin/cassandra-stress write n=10M -rate threads=50`.

The original single proc version took about 30 minutes to export the table.
The multi proc version takes about 7 minutes.


was (Author: dkua):
https://github.com/dkua/cassandra/tree/9304

In the above branch are my improvements to COPY TO. Which basically amounts to 
figuring out the token ranges from the token ring, starting some subprocesses, 
giving each subprocess a subset of the ranges, and have them perform the 
queries asynchronously and pass each formatted page back to the parent process 
to write to the CSV file.

The resulting CSV is unordered so changes to the dtests needed to be made. They 
have been submitted to the dtest repo on Github as a PR.

 COPY TO improvements
 

 Key: CASSANDRA-9304
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9304
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: David Kua
Priority: Minor
  Labels: cqlsh
 Fix For: 2.1.x


 COPY FROM has gotten a lot of love.  COPY TO not so much.  One obvious 
 improvement could be to parallelize reading and writing (write one page of 
 data while fetching the next).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)