[
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15156755#comment-15156755
]
Stefania commented on CASSANDRA-11053:
--------------------------------------
bq. I would like to repeat the tests on an AWS instance with twice the number
of cores, to see how much we can scale.
I've tested with an {{r3.4xlarge}} (16 cores) and with VNODES enabled (256
tokens per host). These tests have highlighted two further areas of improvement:
* One of the initial optimizations, batching by ring position, is not as
effective with VNODES due to the increased number of ring positions: 2048
(8x256) positions means we cannot effectively batch a chunk size of 5000 if all
partition keys are unique. Therefore I've taken a step back and re-introduced
batch by replica albeit in a way not to impact the case where we have
sufficient rows for a given ring position. I've also introduced a new load
balancing policy to avoid converting a partition key into a token twice (once
during batching and once when the driver queries the default token aware
policy).
* I noticed that the worker processes were spending too much time waiting to
receive data on the larger machine. Having the parent process both read files
and wait for results was becoming a careful balancing act of how much time to
dedicate to receiving results: too much time and the processes get starved, too
little and the progress report lags behind. Since in Python threads run
sequentially due to the GIL, I decided to move reading of files into a separate
child process.
I am currently testing these two optimizations. Initial results are
encouraging, and consistent with > 100k rows/sec. I've however noticed some
time outs when VNODES are enabled, unless I increase the batch size. I am
investigating this.
--
bq. Curious, with the introduction of Cython, are you going to give the option
to build, target a specific platform, or build for many and include extensions
pre-built?
I've modified _setup.py_ with the option to build in place ({{python setup.py
build_ext --inplace}}). The idea is to write a blog and give instructions on
how to boost performance by using an installed cythonized driver ({{export
CQLSH_NO_BUNDLED=True}}) and building copyutil. It's not my intention to modify
the existing package for clqsh.
bq. I have visited message coalescing in the driver a couple times, and I
couldn't make it matter. It was in a more controlled environment than AWS, but
I was trying to emulate network delay using local intervention.
As far as I have understood from [this
blog|http://www.datastax.com/dev/blog/performance-doubling-with-message-coalescing],
it matters mostly on virtualized environments, especially AWS without enhanced
networking. Another thing to note is that Nagle is not disabled. I did not
appreciate this when I wrote my earlier comment.
> COPY FROM on large datasets: fix progress report and debug performance
> ----------------------------------------------------------------------
>
> Key: CASSANDRA-11053
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
> Project: Cassandra
> Issue Type: Bug
> Components: Tools
> Reporter: Stefania
> Assignee: Stefania
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: copy_from_large_benchmark.txt,
> copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt,
> worker_profiles.txt, worker_profiles_2.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with
> a smaller cluster locally (approx 35,000 rows per second). As a comparison,
> cassandra-stress manages 50,000 rows per second under the same set-up,
> therefore resulting 1.5 times faster.
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)