[
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15168411#comment-15168411
]
Stefania commented on CASSANDRA-11053:
--------------------------------------
bq. {{del cassandra.deserializers.DesBytesType}} causes the parser to default
back to the patched cqltypes.BytesType
That's interesting. It definitely works. The performance of COPY TO for a
benchmark with only blobs drops from 150k rows/sec to about 120k locally but
the opposite would probably be true for a benchmark with CQL composite types.
It would be very nice to remove the formatting changes from this patch,
especially if it needs to go to 2.1. I've got a [separated
branch|https://github.com/stef1927/cassandra/tree/11053-2.1-no-formatting]
without the formatting changes. I'm unsure what to do: parsing the CQL type is
safer but it is also bolted onto an existing simpler logic that just relies on
Python types and it makes this patch more complex than it needs to be. WDYT?
{quote}
*cqlshlib.formatting.get_sub_types:*
{code}
+ else:
+ if last < len(val) - 1:
+ ret.append(val[last:].strip())
{code}
{quote}
Fixed, thank you.
{quote}
*bin/cqlsh.Shell.print_static_result*
{code}
+ if table_meta:
+ cqltypes = [table_meta.columns[c].typestring if c in
table_meta.columns else None for c in colnames]
{code}
There is an API change in driver 3.0 (C* cqlsh 2.2+) that will impact this.
{quote}
I'm aware of this, I believe all that's needed is to replace {{typestring}}
with {{cql_type}}.
bq. This brings us to the question of targeting 2.1. cqlsh in 2.1 was diverging
from 2.2+, and is even more so after CASSANDRA-10513 (2.1 did not receive the
driver 3.0 upgrade). I'm interested to hear the input on whether this should go
to 2.1.
I've asked offline regarding the target version, hopefully we'll know soon.
{quote}
*"fix progress report"*
It's part of the summary, but I don't see anything in the changeset related to
progress reporting. I ran an identical load with 2.1.13 and noticed that
progress samples
are much less frequent on this branch
{quote}
The progress report was fixed by two things:
* the worker processes only feed aggregated results when the entire chunk is
completed rather than for every batch; this decreased dramatically the number
of results to be collected and also explains the change in frequency of the
progress report. You will have noticed that now the progress increments by a
multiple of the chunk size, rather than batch sizes. The report frequency is
still 4 times per second but if no chunks were completed during this interval
then it will not change, this is expected.
* the introduction of the feeder process; the only job of the parent process is
now to collect results. Before it was sending data and collecting results;
depending on ingest rate and polling sleep time, it could fall behind schedule.
{quote}
*side note*
should we be using repr, or forcing high precision when doing copies to avoid
loss of precision (or providing a precision option for COPY FROM)?
{quote}
The problem isn't COPY FROM, it's COPY TO exporting with the precision of
cqlsh, which by default is too low. I've created CASSANDRA-11255 to add a new
COPY TO option, since this is not related to performance and it's definitely a
new feature.
--
I'm undecided on two more things:
* the default INGESTRATE: 200k may be a little bit too high and I'm thinking of
changing it back to 100k or maybe 120k-150k.
* the number of default worker processes is no longer capped, I think it is
safer to reintroduce the cap of 16, which people can override via NUMPROCESSES.
> COPY FROM on large datasets: fix progress report and debug performance
> ----------------------------------------------------------------------
>
> Key: CASSANDRA-11053
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
> Project: Cassandra
> Issue Type: Bug
> Components: Tools
> Reporter: Stefania
> Assignee: Stefania
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: copy_from_large_benchmark.txt,
> copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt,
> worker_profiles.txt, worker_profiles_2.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with
> a smaller cluster locally (approx 35,000 rows per second). As a comparison,
> cassandra-stress manages 50,000 rows per second under the same set-up,
> therefore resulting 1.5 times faster.
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)