[jira] [Commented] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

Stefania (JIRA) Wed, 03 Feb 2016 00:44:54 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130041#comment-15130041
 ]


Stefania commented on CASSANDRA-11053:
--------------------------------------

I've repeated the benchmark after performing the following:

* Moved csv decoding from parent to worker processes
* Switched to a cythonized driver installation (for cassandra-2.1 we need to 
use version 2.7.2). 

The patch is [here|https://github.com/stef1927/cassandra/tree/11053-2.1].

The set-up and raw results are in _copy_from_large_benchmark_2.txt_ attached, 
along with the new profiler results, _parent_profile_2.txt_ and 
_worker_profiles_2.txt_. The rate has increased from *35,000* to *58,000* rows 
per second for the 1KB test:

{code}cqlsh> COPY test.test1kb FROM 'DSEBulkLoadTest/in/data1KB/*.csv';
Using 7 child processes

Starting copy of test.test1kb with columns ['pkey', 'ccol', 'data'].
Processed: 20480000 rows; Rate:   63987 rows/s; Avg. rate:   58749 rows/s
20480000 rows imported from 20 files in 5 minutes and 48.605 seconds (0 
skipped).
{code}


The progress reporting looks much better now,  because the parent process no 
longer spends time decoding csv data and it has therefore more time to receive 
data and update the progress.

The parent process is fine, even if we optimized it further it wouldn't matter 
since it spends most of its time receiving data (289 out of 437 seconds).

The worker processes can still be improved, here is where we currently spend 
most of the time and what I am looking at improving:

{code}
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    3.606    3.606  432.297  432.297 
/data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1743(run_normal)
   158538   86.237    0.001  245.629    0.002 
/data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1800(send_normal_batch)
   161485   67.401    0.000  167.543    0.001 
/data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1879(split_batches)
   158538   17.603    0.000   84.718    0.001 
/data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1820(convert_rows)
   158538   46.302    0.000   73.577    0.000 
/data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1874(execute_statement)
  2947000   37.470    0.000   61.277    0.000 
/data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1918(get_replica)
  2947000   27.978    0.000   60.145    0.000 
/data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1596(get_row_values)
  2947000    9.383    0.000   32.285    0.000 
/data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1625(get_row_partition_key_values)
  8841000   21.195    0.000   31.220    0.000 
/data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1600(convert)
  2947000   20.328    0.000   21.711    0.000 
/data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1630(serialize)
  2947000    9.220    0.000   20.478    0.000 {filter}
     2948    0.040    0.000   15.513    0.005 
/usr/lib/python2.7/multiprocessing/queues.py:113(get)
     2948   12.770    0.004   12.770    0.004 {method 'recv' of 
'_multiprocessing.Connection' objects}
  8841000   11.258    0.000   11.258    0.000 
/data/automaton/cassandra-src/bin/../pylib/cqlshlib/copyutil.py:1925(<lambda>)
   158558    7.525    0.000   10.034    0.000 
/usr/local/lib/python2.7/dist-packages/cassandra_driver-2.7.2-py2.7-linux-x86_64.egg/cassandra/io/libevreactor.py:355(push)
{code}

I note that 58,000 rows per second is an upper limit both locally against a 
single cassandra node running on the same laptop, or on an {{r3.2xlarge}} AWS 
instance running against a large cassandra cluster of 8 nodes running on 
{{i2.2xlarge}} AWS instances. So the bottleneck is still with the importer even 
after these initial improvements. The same is true for cassandra loader, it 
won't move much beyond 50,000 rows per second with the 1KB benchmark (using 
default parameters and a batch size of 8). 

{code}
time ./cassandra-loader -f DSEBulkLoadTest/in/data1KB -host 172.31.16.79 
-schema "test.test1kb(pkey,ccol,data)" -batchSize 8
[...]
Lines Processed:        20480019  Rate:         50073.39608801956

real    6m50.633s
user    19m0.819s
sys     3m25.961s
{code}

Is this the right command [~brianmhess]?

> COPY FROM on large datasets: fix progress report and debug performance
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-11053
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
>         Attachments: copy_from_large_benchmark.txt, 
> copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt, 
> worker_profiles.txt, worker_profiles_2.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

Reply via email to