[
https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169899#comment-15169899
]
Adam Holmberg commented on CASSANDRA-11053:
-------------------------------------------
bq. The performance of COPY TO for a benchmark with only blobs drops from 150k
rows/sec to about 120k
I didn't expect it to be that punishing since there's no deserialization
happening there. That must just be the cost of the dispatch back to Python.
Here's another option: I could build in another deserializer for BytesType that
returns a bytearray. You would then patch in as follows:
{code}
>>> deserializers.DesBytesType = deserializers.DesBytesTypeByteArray
>>> s.execute('select c from test.t limit 1')[0]
Row(c=bytearray(b'\xde\xad\xbe\xef'))
{code}
I can get it in the upcoming release if it would be useful for this integration.
bq. I'm unsure what to do: parsing the CQL type is safer but ...
I was also on the fence due to the new complexity. I think I favor the cql type
interpretation despite the complexity for one reason: this decouples formatting
from driver return values. They don't change often, but when they have required
specialization for evolving feature support (set-->SortedSet,
dict-->OrderedMap), that would ripple into cqlsh. If we're basing formatting on
cql, that is avoided.
bq. The progress report was fixed by two things...
Thanks. I figured out what my problem was. I was missing most of the diff
because I overlooked on github: "761 additions, 409 deletions not shown because
the diff is too large." I have more to look at
bq. I'm undecided on two more things...default INGESTRATE...default worker
processes
I generally err on the side of caution. Reasonable limits would prevent someone
from inadvertently crushing a server with a basic command. The command options
make it easy enough to dial up for big load operations.
> COPY FROM on large datasets: fix progress report and debug performance
> ----------------------------------------------------------------------
>
> Key: CASSANDRA-11053
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
> Project: Cassandra
> Issue Type: Bug
> Components: Tools
> Reporter: Stefania
> Assignee: Stefania
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
> Attachments: copy_from_large_benchmark.txt,
> copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt,
> worker_profiles.txt, worker_profiles_2.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed
> two issues:
> * The progress report is incorrect, it is very slow until almost the end of
> the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with
> a smaller cluster locally (approx 35,000 rows per second). As a comparison,
> cassandra-stress manages 50,000 rows per second under the same set-up,
> therefore resulting 1.5 times faster.
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)