[jira] [Updated] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

2016-03-28 Thread Joshua McKenzie (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua McKenzie updated CASSANDRA-11053:

Resolution: Fixed
Status: Resolved  (was: Ready to Commit)

[Committed|https://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=commit;h=a9b5422057054b0ba612164d56d7cce5567e48df]

> COPY FROM on large datasets: fix progress report and debug performance
> --
>
> Key: CASSANDRA-11053
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Stefania
>Assignee: Stefania
>  Labels: doc-impacting
> Fix For: 2.1.14, 2.2.6, 3.0.5, 3.5
>
> Attachments: bisect_test.py, copy_from_large_benchmark.txt, 
> copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt, 
> worker_profiles.txt, worker_profiles_2.txt
>
>
> h5. Description
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.
> h5. Doc-impacting changes to COPY FROM options
> * A new option was added: PREPAREDSTATEMENTS - it indicates if prepared 
> statements should be used; it defaults to true.
> * The default value of CHUNKSIZE changed from 1000 to 5000.
> * The default value of MINBATCHSIZE changed from 2 to 10.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

2016-03-22 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-11053:
-
Status: Ready to Commit  (was: Patch Available)

> COPY FROM on large datasets: fix progress report and debug performance
> --
>
> Key: CASSANDRA-11053
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Stefania
>Assignee: Stefania
>  Labels: doc-impacting
> Fix For: 2.1.14, 2.2.6, 3.0.5, 3.5
>
> Attachments: bisect_test.py, copy_from_large_benchmark.txt, 
> copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt, 
> worker_profiles.txt, worker_profiles_2.txt
>
>
> h5. Description
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.
> h5. Doc-impacting changes to COPY FROM options
> * A new option was added: PREPAREDSTATEMENTS - it indicates if prepared 
> statements should be used; it defaults to true.
> * The default value of CHUNKSIZE changed from 1000 to 5000.
> * The default value of MINBATCHSIZE changed from 2 to 10.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

2016-03-21 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-11053:
-
Attachment: bisect_test.py

> COPY FROM on large datasets: fix progress report and debug performance
> --
>
> Key: CASSANDRA-11053
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Stefania
>Assignee: Stefania
>  Labels: doc-impacting
> Fix For: 2.1.14, 2.2.6, 3.0.5, 3.5
>
> Attachments: bisect_test.py, copy_from_large_benchmark.txt, 
> copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt, 
> worker_profiles.txt, worker_profiles_2.txt
>
>
> h5. Description
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.
> h5. Doc-impacting changes to COPY FROM options
> * A new option was added: PREPAREDSTATEMENTS - it indicates if prepared 
> statements should be used; it defaults to true.
> * The default value of CHUNKSIZE changed from 1000 to 5000.
> * The default value of MINBATCHSIZE changed from 2 to 10.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

2016-03-19 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-11053:
-
Status: Patch Available  (was: In Progress)

> COPY FROM on large datasets: fix progress report and debug performance
> --
>
> Key: CASSANDRA-11053
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Stefania
>Assignee: Stefania
>  Labels: doc-impacting
> Fix For: 2.1.14, 2.2.6, 3.0.5, 3.5
>
> Attachments: copy_from_large_benchmark.txt, 
> copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt, 
> worker_profiles.txt, worker_profiles_2.txt
>
>
> h5. Description
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.
> h5. Doc-impacting changes to COPY FROM options
> * A new option was added: PREPAREDSTATEMENTS - it indicates if prepared 
> statements should be used; it defaults to true.
> * The default value of CHUNKSIZE changed from 1000 to 5000.
> * The default value of MINBATCHSIZE changed from 2 to 10.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

2016-03-08 Thread Sylvain Lebresne (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sylvain Lebresne updated CASSANDRA-11053:
-
   Resolution: Fixed
Fix Version/s: (was: 3.0.x)
   (was: 2.2.x)
   (was: 2.1.x)
   (was: 3.x)
   3.5
   3.0.5
   2.2.6
   2.1.14
   Status: Resolved  (was: Ready to Commit)

Committed, thanks.

> COPY FROM on large datasets: fix progress report and debug performance
> --
>
> Key: CASSANDRA-11053
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Stefania
>Assignee: Stefania
>  Labels: doc-impacting
> Fix For: 2.1.14, 2.2.6, 3.0.5, 3.5
>
> Attachments: copy_from_large_benchmark.txt, 
> copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt, 
> worker_profiles.txt, worker_profiles_2.txt
>
>
> h5. Description
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.
> h5. Doc-impacting changes to COPY FROM options
> * A new option was added: PREPAREDSTATEMENTS - it indicates if prepared 
> statements should be used; it defaults to true.
> * The default value of CHUNKSIZE changed from 1000 to 5000.
> * The default value of MINBATCHSIZE changed from 2 to 10.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

2016-03-03 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-11053:
-
Description: 
h5. Description

Running COPY from on a large dataset (20G divided in 20M records) revealed two 
issues:

* The progress report is incorrect, it is very slow until almost the end of the 
test at which point it catches up extremely quickly.

* The performance in rows per second is similar to running smaller tests with a 
smaller cluster locally (approx 35,000 rows per second). As a comparison, 
cassandra-stress manages 50,000 rows per second under the same set-up, 
therefore resulting 1.5 times faster. 

See attached file _copy_from_large_benchmark.txt_ for the benchmark details.

h5. Doc-impacting changes to COPY FROM options

* A new option was added: PREPAREDSTATEMENTS - it indicates if prepared 
statements should be used; it defaults to true.
* The default value of CHUNKSIZE changed from 1000 to 5000.
* The default value of MINBATCHSIZE changed from 2 to 10.

  was:
Running COPY from on a large dataset (20G divided in 20M records) revealed two 
issues:

* The progress report is incorrect, it is very slow until almost the end of the 
test at which point it catches up extremely quickly.

* The performance in rows per second is similar to running smaller tests with a 
smaller cluster locally (approx 35,000 rows per second). As a comparison, 
cassandra-stress manages 50,000 rows per second under the same set-up, 
therefore resulting 1.5 times faster. 

See attached file _copy_from_large_benchmark.txt_ for the benchmark details.


> COPY FROM on large datasets: fix progress report and debug performance
> --
>
> Key: CASSANDRA-11053
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Stefania
>Assignee: Stefania
>  Labels: doc-impacting
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: copy_from_large_benchmark.txt, 
> copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt, 
> worker_profiles.txt, worker_profiles_2.txt
>
>
> h5. Description
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.
> h5. Doc-impacting changes to COPY FROM options
> * A new option was added: PREPAREDSTATEMENTS - it indicates if prepared 
> statements should be used; it defaults to true.
> * The default value of CHUNKSIZE changed from 1000 to 5000.
> * The default value of MINBATCHSIZE changed from 2 to 10.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

2016-03-03 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-11053:
-
Status: Ready to Commit  (was: Patch Available)

> COPY FROM on large datasets: fix progress report and debug performance
> --
>
> Key: CASSANDRA-11053
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Stefania
>Assignee: Stefania
>  Labels: doc-impacting
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: copy_from_large_benchmark.txt, 
> copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt, 
> worker_profiles.txt, worker_profiles_2.txt
>
>
> h5. Description
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.
> h5. Doc-impacting changes to COPY FROM options
> * A new option was added: PREPAREDSTATEMENTS - it indicates if prepared 
> statements should be used; it defaults to true.
> * The default value of CHUNKSIZE changed from 1000 to 5000.
> * The default value of MINBATCHSIZE changed from 2 to 10.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

2016-03-03 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-11053:
-
Labels: doc-impacting  (was: )

> COPY FROM on large datasets: fix progress report and debug performance
> --
>
> Key: CASSANDRA-11053
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Stefania
>Assignee: Stefania
>  Labels: doc-impacting
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: copy_from_large_benchmark.txt, 
> copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt, 
> worker_profiles.txt, worker_profiles_2.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

2016-03-01 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-11053:
-
Status: Patch Available  (was: In Progress)

> COPY FROM on large datasets: fix progress report and debug performance
> --
>
> Key: CASSANDRA-11053
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Stefania
>Assignee: Stefania
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: copy_from_large_benchmark.txt, 
> copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt, 
> worker_profiles.txt, worker_profiles_2.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

2016-03-01 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-11053:
-
Status: In Progress  (was: Patch Available)

There are some tests failing on Windows, resuming progress.

> COPY FROM on large datasets: fix progress report and debug performance
> --
>
> Key: CASSANDRA-11053
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Stefania
>Assignee: Stefania
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: copy_from_large_benchmark.txt, 
> copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt, 
> worker_profiles.txt, worker_profiles_2.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

2016-02-23 Thread Paulo Motta (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-11053:

Reviewer: Adam Holmberg  (was: Paulo Motta)

> COPY FROM on large datasets: fix progress report and debug performance
> --
>
> Key: CASSANDRA-11053
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Stefania
>Assignee: Stefania
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: copy_from_large_benchmark.txt, 
> copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt, 
> worker_profiles.txt, worker_profiles_2.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

2016-02-16 Thread Paulo Motta (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-11053:

Reviewer: Paulo Motta

> COPY FROM on large datasets: fix progress report and debug performance
> --
>
> Key: CASSANDRA-11053
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Stefania
>Assignee: Stefania
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: copy_from_large_benchmark.txt, 
> copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt, 
> worker_profiles.txt, worker_profiles_2.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

2016-02-03 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-11053:
-
Attachment: (was: copy_from_large_benchmark_with_latest_results.txt)

> COPY FROM on large datasets: fix progress report and debug performance
> --
>
> Key: CASSANDRA-11053
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Stefania
>Assignee: Stefania
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: copy_from_large_benchmark.txt, 
> copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt, 
> worker_profiles.txt, worker_profiles_2.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

2016-02-03 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-11053:
-
Attachment: copy_from_large_benchmark_2.txt

> COPY FROM on large datasets: fix progress report and debug performance
> --
>
> Key: CASSANDRA-11053
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Stefania
>Assignee: Stefania
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: copy_from_large_benchmark.txt, 
> copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt, 
> worker_profiles.txt, worker_profiles_2.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

2016-02-03 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-11053:
-
Attachment: (was: copy_from_large_benchmark_2.txt)

> COPY FROM on large datasets: fix progress report and debug performance
> --
>
> Key: CASSANDRA-11053
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Stefania
>Assignee: Stefania
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: copy_from_large_benchmark.txt, 
> copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt, 
> worker_profiles.txt, worker_profiles_2.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

2016-02-03 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-11053:
-
Attachment: copy_from_large_benchmark_2.txt

> COPY FROM on large datasets: fix progress report and debug performance
> --
>
> Key: CASSANDRA-11053
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Stefania
>Assignee: Stefania
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: copy_from_large_benchmark.txt, 
> copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt, 
> worker_profiles.txt, worker_profiles_2.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

2016-02-02 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-11053:
-
Attachment: parent_profile_2.txt
worker_profiles_2.txt
copy_from_large_benchmark_with_latest_results.txt

> COPY FROM on large datasets: fix progress report and debug performance
> --
>
> Key: CASSANDRA-11053
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Stefania
>Assignee: Stefania
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: copy_from_large_benchmark.txt, 
> copy_from_large_benchmark_with_latest_results.txt, parent_profile.txt, 
> parent_profile_2.txt, worker_profiles.txt, worker_profiles_2.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

2016-02-01 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-11053:
-
Attachment: parent_profile.txt
worker_profiles.txt

I've repeated the test in the exact some conditions as described above with 
{{cProfile}} profiling all processes. I am attaching full profile results 
(_worker_profiles.txt_ and _parent_profile.txt_). 

The total test time was approx 15 minutes (900 seconds), of which 15 seconds 
were an artificial sleep in the parent to allow workers to dump their profile 
results.

It is clear that with these large datasets we can no longer afford to read all 
data in the parent and dish out rows as it has been the approach so far. We 
spend in fact over 600 seconds in {{read_rows}}. We also spend significant time 
in the worker processes receiving data (30 seconds). Distributing file names to 
workers and letting them do all the work is pretty easy to do and would solve 
these two issues. However it comes with some consequences:

* We would end up with one process per file unless we somehow split large files 
but splitting large files would take time and users can prepare their data 
themselves. Further, COPY TO can now export to multiple files. Therefore I 
think we should keep things simple and adapt our bulk tests to export to 
multiple files.
* Either we change the meaning of the *max ingest rate* and make it per worker 
process, or we would need to use a global lock which could become a bottleneck. 
I would prefer changing the meaning of max ingest rate as users can always 
specify a rate that is equal to {{max_rate / num_processes}} if they really 
need to.
* To keep things simple, retries would be best handled by worker processes and 
therefore if one process fails then the import fails at least partially; I 
think we can live with this. 

In terms of the worker processes, there is room for improvement there too but 
it is not as straightforward. One interesting thing to do would be to use a 
cythonized driver version but this would not work out of the box due to the 
formatting hooks we inject in the driver. We spend a lot of time batching 
records, getting the replicas, binding parameters and hashing (_murmur3).

WDYK [~pauloricardomg] and [~thobbs]?

> COPY FROM on large datasets: fix progress report and debug performance
> --
>
> Key: CASSANDRA-11053
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Stefania
>Assignee: Stefania
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: copy_from_large_benchmark.txt, parent_profile.txt, 
> worker_profiles.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

2016-02-01 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-11053:
-
Attachment: (was: copy_from_large_benchmark.txt)

> COPY FROM on large datasets: fix progress report and debug performance
> --
>
> Key: CASSANDRA-11053
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Stefania
>Assignee: Stefania
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: copy_from_large_benchmark.txt, parent_profile.txt, 
> worker_profiles.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

2016-02-01 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-11053:
-
Attachment: copy_from_large_benchmark.txt

> COPY FROM on large datasets: fix progress report and debug performance
> --
>
> Key: CASSANDRA-11053
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Stefania
>Assignee: Stefania
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: copy_from_large_benchmark.txt, parent_profile.txt, 
> worker_profiles.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

2016-01-20 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-11053:
-
Attachment: (was: copy_from_large_benchmark.txt)

> COPY FROM on large datasets: fix progress report and debug performance
> --
>
> Key: CASSANDRA-11053
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Stefania
>Assignee: Stefania
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: copy_from_large_benchmark.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

2016-01-20 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-11053:
-
Attachment: copy_from_large_benchmark.txt

> COPY FROM on large datasets: fix progress report and debug performance
> --
>
> Key: CASSANDRA-11053
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Stefania
>Assignee: Stefania
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: copy_from_large_benchmark.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)