[
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15144508#comment-15144508
]
Stefania commented on CASSANDRA-11053:
--------------------------------------
Here are the latest results:
||MODULE CYTHONIZED||PREPARED STATEMENTS||NUM. WORKER PROCESSES||CHUNK
SIZE||AVERAGE ROWS / SEC||TOTAL TIME||APPROX ROWS / SEC IN REAL-TIME (50% ->
95%)||
|NONE|YES|7|1,000|44,115|7' 44'"|43,700 -> 44,000|
|NONE|NO|7|1,000|58,345|5' 51"|57,800 -> 58,200|
|DRIVER|YES|7|1,000|77,719|4' 23"|77,300 -> 77,600|
|DRIVER|NO \(*\)|7|1,000|94,508 \(*\)|3' 36"|94,000 -> 95,000|
|DRIVER|YES|15|1,000|78,429|4' 21"|77,900 -> 78,300|
|DRIVER|YES|7|10,000|78,746|4' 20"|78,000 -> 78,500|
|DRIVER|YES|7|5,000|79,337|4" 18"|78,900 -> 79,200|
|DRIVER|YES|8|5,000|81,636|4' 10"|80,900 -> 81,500|
|DRIVER|YES|9|5,000|*82,584*|4' 8"|82,000 -> 82,500|
|DRIVER|YES|10|5,000|82,486|4' 8"|81,800 -> 82,400|
|DRIVER|YES|9|2500|82,013|4' 9"|81,500 -> 81,900|
|DRIVER + COPYUTIL|YES|9|5,000|*88,187*|3' 52"|87,900 -> 88,100|
|DRIVER + COPYUTIL|NO \(*\)|9|5,000|87,860 \(*\)|3' 53"|99,600 -> 93,800|
I've also saved the results in a
[spreadsheet|https://docs.google.com/spreadsheets/d/1XTE2fSDJkwHzpdaD5HI0HlsFuPCW1Kc1NeqauF6WX2s].
The column on the right contains two approximate observations of the real-time
rate at about half-way through and just before finishing. It's purpose is
simply to verify that the real-time rate is fine now, it no longer lags behind
as it used to do.
The test runs with a \(*\) were affected by time outs, indicating the cluster
had reached capacity. This is to be expected given that with non-prepared
statements we shift the parsing burden to cassandra nodes forcing them to
compile each batch statement as well. I don't consider this a particularly good
thing to do, as it is only applicable when the cluster is over-sized and
therefore I focused my efforts and search for optimal parameters to the case
with prepared statements (the default). In the very last run, we can see how
half-way through we had an average of 99,600 but it then plummeted just before
finishing due to a long pause (there is an exponential back-off policy that
kicks in on timeouts).
The improvements over the [last set of
results|https://issues.apache.org/jira/browse/CASSANDRA-11053?focusedCommentId=15133899&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15133899]
are mostly due to tailored optimizations of Python code via the Python [line
profiler|https://github.com/rkern/line_profiler]. I've also reduced the amount
of data sent from worker processes to the parent by aggregating results. This
helped the real time reporting tremendously. I've also added support for libev
if it is installed, as described in the driver [installation
guide|https://datastax.github.io/python-driver/installation.html]. Finally, I
fixed a problem with type formatting introduced by the cythonized driver.
With these improvements, together with those previously adopted, worker and
parent processes are no longer as tightly coupled and I therefore experimented
with the number of worker processes and the chunk size. The default number of
worker processes is 7 (num-cores minus 1). However it seems from observation
that num-cores + 1 gives better results. I've monitored vmstats with {{dstat}}
and the running tasks were reasonable (less than 2*num-cores). As for the chunk
size, the default value of 1000 is probably too small, and it seems 5000 is a
better value for this particular dataset and environment. However, I don't
propose that we change the current default values as they are safer for smaller
environments such as laptops.
I've also spent time trying to improve csv parsing times, by comparing
alternatives based on [pandas|http://pandas.pydata.org/],
[numpy|http://www.numpy.org/] and [numba|http://numba.pydata.org/] but none
were worth pursuing further, at least not for this benchmark with very simple
type conversions (text and integers). For more complex data types, such as
dates or collections, perhaps pure cython conversion functions would help
significantly.
Whilst I still have a new set of profiler results to analyse, I feel that we
are reaching a point where our efforts could be better spent elsewhere due to
diminishing returns. As a comparison, cassandra stress with approx 1KB
partitions inserted 5M rows at a rate of 93k rows per second. As this is well
within 10% of our results, I suggest we should consider focussing on
alternative means of optimizations for wider user cases, such as supporting
binary formats for COPY TO / FROM or optimizing text conversion of complex data
types.
> COPY FROM on large datasets: fix progress report and debug performance
> ----------------------------------------------------------------------
>
> Key: CASSANDRA-11053
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
> Project: Cassandra
> Issue Type: Bug
> Components: Tools
> Reporter: Stefania
> Assignee: Stefania
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: copy_from_large_benchmark.txt,
> copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt,
> worker_profiles.txt, worker_profiles_2.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with
> a smaller cluster locally (approx 35,000 rows per second). As a comparison,
> cassandra-stress manages 50,000 rows per second under the same set-up,
> therefore resulting 1.5 times faster.
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)