[
https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13800851#comment-13800851
]
Aleksey Yeschenko commented on CASSANDRA-6134:
----------------------------------------------
bq. It runs every 1/2 of write timeout and replays all batches written within
0.9 * write timeout from now. This way we ensure, that batched updates will be
replayed to th moment client times out from coordinator.
This isn't what we want to ensure though. The current timeout (write timeout *
2) is there to account for maximum batchlog write timeout + actual data write
timeout. Avoiding extra mutations is IMO more important than having less delay
in the failure scenario (and slow writes would happen more often than outright
failures). And you definitely don't want to hammer an already slow node with
twice the load. So -1 on this particular change.
bq. It submits all mutations from single batch in parallel (Like StorageProxy
do). Old implementation played them one-by-one, so client can see half applied
batches in CF for a long time (depending on size of batch).
This is fine. But yeah, we could/should parallelize batchlog replay more (can
be done w/out modifying the schema).
bq. It fixes a subtle racing bug with incorrect hint TTL calculation
Care to elaborate? I think there was a tricky open bug related to this, but
can't fine the JIRA #.
To avoid random reads, we could read the mutation blob in
replayAllFailedBatches() and and pass it to replayBatch() (I thought we were
already doing that). To make replay more async, as you suggest, we could read
several batches and initate their replay async instead of replaying them one by
one (but w/ RateLimiter in place).
To avoid iterating over the already replayed batches (tombstones), we could
purge the replayed batches directly from the memtable (although I'd need to see
a benchmark proving that it's worth doing it first).
Other stuff, in no particular order:
- making the table COMPACT STORAGE limits our flexibility wrt future batchlog
schema changes, so -1 on that
- we should probably rate-limit batchlog replay w/ RateLimiter
- +1 on moving forceBatchlogReplay() to batchlogTasks as well (this was an
omission from CASSANDRA-6079, ninja-committed it in
7e057f504613e68082a76642983d353f3f0400fb)
- +1 on running cleanup() on startup
- -1 on using writeTime for TTL calculation from the UUID (the time can
actually jump, but uuids will always increase, and it's not what we want for
TTL calc)
In general:
I like some of the suggested changes, and would like to see the ones that are
possible w/out the schema change implemented first. I'm strongly against
altering the batchlog schema, unless the benchmarks can clearly prove that the
version with the partitioned schema is significantly better than what we could
come up with without altering the schema, and many of them can be. We should
avoid any potentially brittle/breaking extra migration code on the already
slow-ish startup.
Could you give it a try, [~m0nstermind]? Namely,
- replaying several mutations read in replayAllFailedBatches() simultaneously
instead of 1-by-1
- avoiding the random read by passing the read blob to replayBatch()
- measure the effect of purging the replayed batch from the memtable (when not
read from the disk)
If this gives us most of the win of a version with the altered schema, then
I'll be satisfied with just those changes. If benchmarks say that we have a lot
extra relative and absolute efficiency to gain from the schema change, then I
won't argue with the data.
> More efficient BatchlogManager
> ------------------------------
>
> Key: CASSANDRA-6134
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6134
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Oleg Anastasyev
> Priority: Minor
> Attachments: BatchlogManager.txt
>
>
> As we discussed earlier in CASSANDRA-6079 this is the new BatchManager.
> It stores batch records in
> {code}
> CREATE TABLE batchlog (
> id_partition int,
> id timeuuid,
> data blob,
> PRIMARY KEY (id_partition, id)
> ) WITH COMPACT STORAGE AND
> CLUSTERING ORDER BY (id DESC)
> {code}
> where id_partition is minute-since-epoch of id uuid.
> So when it scans for batches to replay ot scans within a single partition for
> a slice of ids since last processed date till now minus write timeout.
> So no full batchlog CF scan and lot of randrom reads are made on normal
> cycle.
> Other improvements:
> 1. It runs every 1/2 of write timeout and replays all batches written within
> 0.9 * write timeout from now. This way we ensure, that batched updates will
> be replayed to th moment client times out from coordinator.
> 2. It submits all mutations from single batch in parallel (Like StorageProxy
> do). Old implementation played them one-by-one, so client can see half
> applied batches in CF for a long time (depending on size of batch).
> 3. It fixes a subtle racing bug with incorrect hint ttl calculation
--
This message was sent by Atlassian JIRA
(v6.1#6144)