[ https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130041#comment-15130041 ]
Stefania commented on CASSANDRA-11053: -------------------------------------- I've repeated the benchmark after performing the following: * Moved csv decoding from parent to worker processes * Switched to a cythonized driver installation (for cassandra-2.1 we need to use version 2.7.2). The patch is [here|https://github.com/stef1927/cassandra/tree/11053-2.1]. The set-up and raw results are in _copy_from_large_benchmark_2.txt_ attached, along with the new profiler results, _parent_profile_2.txt_ and _worker_profiles_2.txt_. The rate has increased from *35,000* to *58,000* rows per second for the 1KB test: {code}cqlsh> COPY test.test1kb FROM 'DSEBulkLoadTest/in/data1KB/*.csv'; Using 7 child processes Starting copy of test.test1kb with columns ['pkey', 'ccol', 'data']. Processed: 20480000 rows; Rate: 63987 rows/s; Avg. rate: 58749 rows/s 20480000 rows imported from 20 files in 5 minutes and 48.605 seconds (0 skipped). {code} The progress reporting looks much better now, because the parent process no longer spends time decoding csv data and it has therefore more time to receive data and update the progress. The parent process is fine, even if we optimized it further it wouldn't matter since it spends most of its time receiving data (289 out of 437 seconds). The worker processes can still be improved, here is where we currently spend most of the time and what I am looking at improving: {code} ncalls tottime percall cumtime percall filename:lineno(function) 1 3.606 3.606 432.297 432.297 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1743(run_normal) 158538 86.237 0.001 245.629 0.002 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1800(send_normal_batch) 161485 67.401 0.000 167.543 0.001 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1879(split_batches) 158538 17.603 0.000 84.718 0.001 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1820(convert_rows) 158538 46.302 0.000 73.577 0.000 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1874(execute_statement) 2947000 37.470 0.000 61.277 0.000 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1918(get_replica) 2947000 27.978 0.000 60.145 0.000 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1596(get_row_values) 2947000 9.383 0.000 32.285 0.000 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1625(get_row_partition_key_values) 8841000 21.195 0.000 31.220 0.000 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1600(convert) 2947000 20.328 0.000 21.711 0.000 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1630(serialize) 2947000 9.220 0.000 20.478 0.000 {filter} 2948 0.040 0.000 15.513 0.005 /usr/lib/python2.7/multiprocessing/queues.py:113(get) 2948 12.770 0.004 12.770 0.004 {method 'recv' of '_multiprocessing.Connection' objects} 8841000 11.258 0.000 11.258 0.000 /data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1925(<lambda>) 158558 7.525 0.000 10.034 0.000 /usr/local/lib/python2.7/dist-packages/cassandra_driver-2.7.2-py2.7-linux-x86_64.egg/cassandra/io/libevreactor.py:355(push) {code} I note that 58,000 rows per second is an upper limit both locally against a single cassandra node running on the same laptop, or on an {{r3.2xlarge}} AWS instance running against a large cassandra cluster of 8 nodes running on {{i2.2xlarge}} AWS instances. So the bottleneck is still with the importer even after these initial improvements. The same is true for cassandra loader, it won't move much beyond 50,000 rows per second with the 1KB benchmark (using default parameters and a batch size of 8). {code} time ./cassandra-loader -f DSEBulkLoadTest/in/data1KB -host 172.31.16.79 -schema "test.test1kb(pkey,ccol,data)" -batchSize 8 [...] Lines Processed: 20480019 Rate: 50073.39608801956 real 6m50.633s user 19m0.819s sys 3m25.961s {code} Is this the right command [~brianmhess]? > COPY FROM on large datasets: fix progress report and debug performance > ---------------------------------------------------------------------- > > Key: CASSANDRA-11053 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11053 > Project: Cassandra > Issue Type: Bug > Components: Tools > Reporter: Stefania > Assignee: Stefania > Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x > > Attachments: copy_from_large_benchmark.txt, > copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt, > worker_profiles.txt, worker_profiles_2.txt > > > Running COPY from on a large dataset (20G divided in 20M records) revealed > two issues: > * The progress report is incorrect, it is very slow until almost the end of > the test at which point it catches up extremely quickly. > * The performance in rows per second is similar to running smaller tests with > a smaller cluster locally (approx 35,000 rows per second). As a comparison, > cassandra-stress manages 50,000 rows per second under the same set-up, > therefore resulting 1.5 times faster. > See attached file _copy_from_large_benchmark.txt_ for the benchmark details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)