[jira] [Commented] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

Stefania (JIRA) Thu, 25 Feb 2016 20:19:30 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15168411#comment-15168411
 ]


Stefania commented on CASSANDRA-11053:
--------------------------------------

bq. {{del cassandra.deserializers.DesBytesType}} causes the parser to default 
back to the patched cqltypes.BytesType

That's interesting. It definitely works. The performance of COPY TO for a 
benchmark with only blobs drops from 150k rows/sec to about 120k locally but 
the opposite would probably be true for a benchmark with CQL composite types. 
It would be very nice to remove the formatting changes from this patch, 
especially if it needs to go to 2.1. I've got a [separated 
branch|https://github.com/stef1927/cassandra/tree/11053-2.1-no-formatting] 
without the formatting changes. I'm unsure what to do: parsing the CQL type is 
safer but it is also bolted onto an existing simpler logic that just relies on 
Python types and it makes this patch more complex than it needs to be. WDYT?

{quote}
*cqlshlib.formatting.get_sub_types:*

{code}
+    else:
+        if last < len(val) - 1:
+            ret.append(val[last:].strip())
{code}
{quote}

Fixed, thank you.

{quote}
*bin/cqlsh.Shell.print_static_result*

{code}
+        if table_meta:
+            cqltypes = [table_meta.columns[c].typestring if c in 
table_meta.columns else None for c in colnames]
{code}
There is an API change in driver 3.0 (C* cqlsh 2.2+) that will impact this.
{quote}

I'm aware of this, I believe all that's needed is to replace {{typestring}} 
with {{cql_type}}. 

bq. This brings us to the question of targeting 2.1. cqlsh in 2.1 was diverging 
from 2.2+, and is even more so after CASSANDRA-10513 (2.1 did not receive the 
driver 3.0 upgrade). I'm interested to hear the input on whether this should go 
to 2.1.

I've asked offline regarding the target version, hopefully we'll know soon.

{quote}
*"fix progress report"*
It's part of the summary, but I don't see anything in the changeset related to 
progress reporting. I ran an identical load with 2.1.13 and noticed that 
progress samples
are much less frequent on this branch
{quote}

The progress report was fixed by two things:

* the worker processes only feed aggregated results when the entire chunk is 
completed rather than for every batch; this decreased dramatically the number 
of results to be collected and also explains the change in frequency of the 
progress report. You will have noticed that now the progress increments by a 
multiple of the chunk size, rather than batch sizes. The report frequency is 
still 4 times per second but if no chunks were completed during this interval 
then it will not change, this is expected.

* the introduction of the feeder process; the only job of the parent process is 
now to collect results. Before it was sending data and collecting results; 
depending on ingest rate and polling sleep time, it could fall behind schedule. 

{quote}
*side note*
should we be using repr, or forcing high precision when doing copies to avoid 
loss of precision (or providing a precision option for COPY FROM)?
{quote}

The problem isn't COPY FROM, it's COPY TO exporting with the precision of 
cqlsh, which by default is too low. I've created CASSANDRA-11255 to add a new 
COPY TO option, since this is not related to performance and it's definitely a 
new feature.

--

I'm undecided on two more things:

* the default INGESTRATE: 200k may be a little bit too high and I'm thinking of 
changing it back to 100k or maybe 120k-150k.
* the number of default worker processes is no longer capped, I think it is 
safer to reintroduce the cap of 16, which people can override via NUMPROCESSES.

> COPY FROM on large datasets: fix progress report and debug performance
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-11053
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
>         Attachments: copy_from_large_benchmark.txt, 
> copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt, 
> worker_profiles.txt, worker_profiles_2.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

Reply via email to