[jira] [Comment Edited] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

Adam Holmberg (JIRA) Fri, 26 Feb 2016 13:50:15 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169899#comment-15169899
 ]


Adam Holmberg edited comment on CASSANDRA-11053 at 2/26/16 9:49 PM:
--------------------------------------------------------------------

bq. The performance of COPY TO for a benchmark with only blobs drops from 150k 
rows/sec to about 120k
I didn't expect it to be that punishing since there's no deserialization 
happening there. That must just be the cost of the dispatch back to Python. 
Here's another option: I could build in another deserializer for BytesType that 
returns a bytearray. You would then patch in as follows:
{code}
>>> deserializers.DesBytesType = deserializers.DesBytesTypeByteArray

>>> s.execute('select c from test.t limit 1')[0]
    Row(c=bytearray(b'\xde\xad\xbe\xef'))
{code}
I can get it in the upcoming release if it would be useful for this integration.

bq. I'm unsure what to do: parsing the CQL type is safer but ...
I was also on the fence due to the new complexity. I think I favor the cql type 
interpretation despite the complexity for one reason: this decouples formatting 
from driver return values. They don't change often, but when they have required 
specialization for evolving feature support (set-\->SortedSet, 
dict-\->OrderedMap), that would ripple into cqlsh. If we're basing formatting 
on cql, that is avoided.

bq. The progress report was fixed by two things...
Thanks. I figured out what my problem was. I was missing most of the diff 
because I overlooked on github: "761 additions, 409 deletions not shown because 
the diff is too large." I have more to look at

bq. I'm undecided on two more things...default INGESTRATE...default worker 
processes
I generally err on the side of caution. Reasonable limits would prevent someone 
from inadvertently crushing a server with a basic command. The command options 
make it easy enough to dial up for big load operations.


was (Author: aholmber):
bq. The performance of COPY TO for a benchmark with only blobs drops from 150k 
rows/sec to about 120k
I didn't expect it to be that punishing since there's no deserialization 
happening there. That must just be the cost of the dispatch back to Python. 
Here's another option: I could build in another deserializer for BytesType that 
returns a bytearray. You would then patch in as follows:
{code}
>>> deserializers.DesBytesType = deserializers.DesBytesTypeByteArray

>>> s.execute('select c from test.t limit 1')[0]
    Row(c=bytearray(b'\xde\xad\xbe\xef'))
{code}
I can get it in the upcoming release if it would be useful for this integration.

bq. I'm unsure what to do: parsing the CQL type is safer but ...
I was also on the fence due to the new complexity. I think I favor the cql type 
interpretation despite the complexity for one reason: this decouples formatting 
from driver return values. They don't change often, but when they have required 
specialization for evolving feature support (set-->SortedSet, 
dict-->OrderedMap), that would ripple into cqlsh. If we're basing formatting on 
cql, that is avoided.

bq. The progress report was fixed by two things...
Thanks. I figured out what my problem was. I was missing most of the diff 
because I overlooked on github: "761 additions, 409 deletions not shown because 
the diff is too large." I have more to look at

bq. I'm undecided on two more things...default INGESTRATE...default worker 
processes
I generally err on the side of caution. Reasonable limits would prevent someone 
from inadvertently crushing a server with a basic command. The command options 
make it easy enough to dial up for big load operations.

> COPY FROM on large datasets: fix progress report and debug performance
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-11053
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
>         Attachments: copy_from_large_benchmark.txt, 
> copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt, 
> worker_profiles.txt, worker_profiles_2.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed 
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of 
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with 
> a smaller cluster locally (approx 35,000 rows per second). As a comparison, 
> cassandra-stress manages 50,000 rows per second under the same set-up, 
> therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

Reply via email to