[ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15162800#comment-15162800
 ] 

Stefania commented on CASSANDRA-11053:
--------------------------------------

The ticket is ready for review [~aholmber].

The patch: https://github.com/stef1927/cassandra/tree/11053-2.1
The dtest branch: https://github.com/stef1927/cassandra-dtest/tree/11053
The dtest results: 
http://cassci.datastax.com/job/stef1927-11053-2.1-dtest/lastCompletedBuild/testReport/

The reason the patch targets 2.1 is mostly to compare with cassandra loader. We 
must decide whether we want this in 2.1 or 2.2. Technically it should be in 2.2 
but the original COPY FROM performance ticket (CASSANDRA-9302) was new 
functionality in 2.1 and hence I am not entirely sure, cc [~jbellis].

h3. Performance results

The full results are available in this 
[spreadsheet|https://docs.google.com/spreadsheets/d/1XTE2fSDJkwHzpdaD5HI0HlsFuPCW1Kc1NeqauF6WX2s/edit#gid=0].

h4. Executive Summary

The optimizations introduced by this patch have increased the import speed 
(average rows per second) for the 1KB test (20,480,000 rows of 1kb, as 
specified in this 
[benchmark|https://github.com/brianmhess/DSEBulkLoadTest.git]) from *35k* to 
*117k* on an 8 core VM and to *190k* on a 16 core VM. To achieve this, the 
following points were adopted and are listed in order of importance:

* the driver must be installed with c extensions, cython and libev, and used by 
setting {{export CQLSH_NO_BUNDLED=TRUE}}
* the number of worker processes and the chunk size COPY FROM options must be 
adjusted for maximum performance by looking at performance stats ({{dstat -lvrn 
10}}) and minimizing the CPU idle time
* minimum and maximum batch size must be increased if close to cluster 
saturation or in the presence of VNODES
* Linux CPU scheduling must be set to {{SCHED_BATCH}} via {{schedtool -B}}
* the cqlsh copyutil module must also be cythonized via {{python setup.py 
build_ext --inplace}}
* the clock source must be set to "tsc"

For further details refer to the setup instructions in the spreadsheet.

h4. Setup

The spreadsheet above contains full setup details, including commands on how to 
launch and configure AWS instances. All tests were run on Ubuntu 14.04 with 
clock source set to tsc, using prepared statements. The client was running 
cassandra-2.1 with this ticket patch, and the servers were running 
cassandra-2.1 HEAD with the following customizations:

{code}
batch_size_warn_threshold_in_kb: 65
compaction_throughput_mb_per_sec: 0
auto_snapshot: false
{code}


h4. R3.2XLARGE client and 8 node I2.2XLARGE cluster

This is the set-up that has been used so far. These tests give the final 
numbers and highlight the impact of cythonizing modules:

||MODULE CYTHONIZED||NUM WORKER PROCESSES||CHUNK SIZE||AVG. ROWS / SEC||
|NONE|12|5,000|70k|
|DRIVER|12|5,000|110k|
|DRIVER + COPYUTIL|12|5,000|117k|

The number of worker processes and chunk size were chosen optimally after 
several trials. It was noted that the optimal number of worker processes is 
slightly higher than the number of cores. CPU usage was observed with {{dstat}} 
in order to minimize idle time, and increase network packets sent. 

Here is the impact of introducing VNODES with 256 tokens per node:

||MODULE CYTHONIZED||NUM WORKER PROCESSES||CHUNK SIZE||ADDITIONAL PARAMS||AVG. 
ROWS / SEC||
|DRIVER|12|5,000| |86k \(T\)|
|DRIVER|12|10,000| |68k \(T\)|
|DRIVER|12|20,000| |66k \(T\)|
|DRIVER|12|40,000| |63k \(T\)|
|DRIVER|12|20,000|"MIN_BATCH_SIZE=20 MAX_BATCH_SIZE=30"|95k \(B\)|
|DRIVER|12|30,000|"MIN_BATCH_SIZE=30 MAX_BATCH_SIZE=40"|101k \(B\)|
|DRIVER|12|40,000|"MIN_BATCH_SIZE=30 MAX_BATCH_SIZE=40"|105k \(B\)|

*\(T\)* indicates timeouts, which resulted in degraded average performance. 
*\(B\)* indicates batch sizes higher than default values (MIN_BATCH_SIZE=10 
MAX_BATCH_SIZE=20) and was a workaround adopted to mitigate the timeouts as 
indicated by the results.

Another workaround for the timeouts was to increase cluster size and was 
adopted in the next set of tests.

h4. C3.2XLARGE client and 15 node C3.2XLARGE cluster

To investigate scaling and improvements with larger clusters we switched to AWS 
instances with less memory (COPY FROM doesn't need it) and less storage (by 
truncating the table after every measurement and setting {{auto_snapshot}} to 
false in cassandra.yaml we were able to perform tests with the same dataset but 
less disk space). All tests were run with only the driver cythonized, not 
copyutil, and using prepared statements.

This is the impact of batch CPU scheduling ({{schedtool -B}}):

||NUM WORKER PROCESSES||CHUNK SIZE||MIN-MAX BATCH SIZE||{{schedtool -B}}||AVG. 
ROWS / SEC||
|12|5000|10-20 (default)|NO|99k|
|12|10000|10-20 (default)|NO|97k|
|12|5000|20-40|NO|111k|
|12|10000|20-40|NO|109k|
|12|5000|30-50|NO|114k|
|12|10000|30-50|NO|113k|
|12|5000|10-20 (default)|YES|107k|
|12|10000|10-20 (default)|YES|106k|
|12|5000|20-40|YES|117k|
|12|10000|20-40|YES|115k|
|12|5000|30-50|YES|117k|
|12|10000|30-50|YES|117k|

This is the impact of VNODES (NUM TOKENS > 1), all measurements taken with 
{{schedtool -B}}:

||NUM TOKENS||NUM WORKER PROCESSES||CHUNK SIZE||MIN-MAX BATCH SIZE||AVG. ROWS / 
SEC||
|1|12|5000|30-50|117k|
|1|12|10000|30-50|117k|
|64|12|5000|30-50|115k|
|64|12|10000|30-50|113k|
|128|12|5000|30-50|109k|
|128|12|10000|30-50|109k|
|256|12|10000|30-50|109k|
|256|12|20000|30-50|109k|

Running with smaller batch sizes has an even higher impact, refer to the 
spreadsheet for details.

h4. C3.4XLARGE client and 16 node C3.2XLARGE cluster

Here the client was scaled by upgrading from 8 to 16 cores. All tests were run 
without VNODES (NUM TOKENS = 1) and with only the driver cythonized, using 
prepared statements.

||NUM WORKER PROCESSES||CHUNK SIZE||MIN-MAX BATCH SIZE||AVG. ROWS / SEC||NOTES||
|28|5000|10-20 (default)|100k to 190k|timeouts|
|24|5000|10-20 (default)|100k to 190k|timeouts|
|20|5000|10-20 (default)|100k to 190k|timeouts|
|16|5000|10-20 (default)|100k to 190k|timeouts|
|20|5000|10-20 (default)|167k|INGESTRATE set to 170000|
|20|10000|20-40|192k, 181k|hardly any timeouts|
|16|10000|20-40|182k|hardly any timeouts|
|20|20000|20-40|181k|hardly any timeouts|
|20|40000|20-40|149k|hardly any timeouts|
|20|10000|30-50|192k, 191k|no timeouts|

Due to timeouts, the average import rate was very inconsistent. Typically it 
would import 190k rows per second until about half of the dataset was imported, 
and then it would freeze for variable periods of time due to cluster 
saturation. 

Workarounds used to mitigate timeouts were to either fix the ingest rate, or 
increase the batch size. Fixing the ingest rate to 170k resulted in a 
sustainable import rate of 167k. 

Increasing the min batch size from the default of 10 to 30 and the max batch 
size from 20 to 50 resulted in a consistent rate of 190k.

Overall, comparing these results to the previous test results without VNODES, 
indicates that COPY FROM is able to scale from 117k to 190k when switching from 
8 to 16 cores.

h3. Some possible follow ups for future tickets

Here are some suggestions for improving performance further:

* Use the driver at a much lower level (messages and borrowed connections 
rather than statements and futures)
* Read data from files and transfer it to worker processes faster, perhaps with 
shared memory
* Support binary formats
* Implement csv parsing and type conversion functions in cython (important for 
complex data types such as dates and collections)
* Implement message coalescing

> COPY FROM on large datasets: fix progress report and debug performance
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-11053
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
>         Attachments: copy_from_large_benchmark.txt, 
> copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt, 
> worker_profiles.txt, worker_profiles_2.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to