[ https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169899#comment-15169899 ]
Adam Holmberg edited comment on CASSANDRA-11053 at 2/26/16 9:49 PM: -------------------------------------------------------------------- bq. The performance of COPY TO for a benchmark with only blobs drops from 150k rows/sec to about 120k I didn't expect it to be that punishing since there's no deserialization happening there. That must just be the cost of the dispatch back to Python. Here's another option: I could build in another deserializer for BytesType that returns a bytearray. You would then patch in as follows: {code} >>> deserializers.DesBytesType = deserializers.DesBytesTypeByteArray >>> s.execute('select c from test.t limit 1')[0] Row(c=bytearray(b'\xde\xad\xbe\xef')) {code} I can get it in the upcoming release if it would be useful for this integration. bq. I'm unsure what to do: parsing the CQL type is safer but ... I was also on the fence due to the new complexity. I think I favor the cql type interpretation despite the complexity for one reason: this decouples formatting from driver return values. They don't change often, but when they have required specialization for evolving feature support (set-\->SortedSet, dict-\->OrderedMap), that would ripple into cqlsh. If we're basing formatting on cql, that is avoided. bq. The progress report was fixed by two things... Thanks. I figured out what my problem was. I was missing most of the diff because I overlooked on github: "761 additions, 409 deletions not shown because the diff is too large." I have more to look at bq. I'm undecided on two more things...default INGESTRATE...default worker processes I generally err on the side of caution. Reasonable limits would prevent someone from inadvertently crushing a server with a basic command. The command options make it easy enough to dial up for big load operations. was (Author: aholmber): bq. The performance of COPY TO for a benchmark with only blobs drops from 150k rows/sec to about 120k I didn't expect it to be that punishing since there's no deserialization happening there. That must just be the cost of the dispatch back to Python. Here's another option: I could build in another deserializer for BytesType that returns a bytearray. You would then patch in as follows: {code} >>> deserializers.DesBytesType = deserializers.DesBytesTypeByteArray >>> s.execute('select c from test.t limit 1')[0] Row(c=bytearray(b'\xde\xad\xbe\xef')) {code} I can get it in the upcoming release if it would be useful for this integration. bq. I'm unsure what to do: parsing the CQL type is safer but ... I was also on the fence due to the new complexity. I think I favor the cql type interpretation despite the complexity for one reason: this decouples formatting from driver return values. They don't change often, but when they have required specialization for evolving feature support (set-->SortedSet, dict-->OrderedMap), that would ripple into cqlsh. If we're basing formatting on cql, that is avoided. bq. The progress report was fixed by two things... Thanks. I figured out what my problem was. I was missing most of the diff because I overlooked on github: "761 additions, 409 deletions not shown because the diff is too large." I have more to look at bq. I'm undecided on two more things...default INGESTRATE...default worker processes I generally err on the side of caution. Reasonable limits would prevent someone from inadvertently crushing a server with a basic command. The command options make it easy enough to dial up for big load operations. > COPY FROM on large datasets: fix progress report and debug performance > ---------------------------------------------------------------------- > > Key: CASSANDRA-11053 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11053 > Project: Cassandra > Issue Type: Bug > Components: Tools > Reporter: Stefania > Assignee: Stefania > Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x > > Attachments: copy_from_large_benchmark.txt, > copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt, > worker_profiles.txt, worker_profiles_2.txt > > > Running COPY from on a large dataset (20G divided in 20M records) revealed > two issues: > * The progress report is incorrect, it is very slow until almost the end of > the test at which point it catches up extremely quickly. > * The performance in rows per second is similar to running smaller tests with > a smaller cluster locally (approx 35,000 rows per second). As a comparison, > cassandra-stress manages 50,000 rows per second under the same set-up, > therefore resulting 1.5 times faster. > See attached file _copy_from_large_benchmark.txt_ for the benchmark details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)