[
https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13802135#comment-13802135
]
Oleg Anastasyev commented on CASSANDRA-6134:
--------------------------------------------
bq. This isn't what we want to ensure though. The current timeout (write
timeout * 2) is there to account for maximum batchlog write timeout + actual
data write timeout. Avoiding extra mutations is IMO more important than having
less delay in the failure scenario (and slow writes would happen more often
than outright failures).
As we discussed earlier, whole batchlog thing makes little sense, if clients
cannot read their own writes. Consider client written to batchlog very fast and
timed out from coordinator having batch half applied. Reading from another
coordinator it would see batch partially applied for almost yet another write
timeout. So just having write timeout*2 is not a good idea. From the other
hand, "hammering" is one-by-one replay of unplayed mutation. Dont think this
could be an issue practically. +1 having RateLimiter there, so hammering could
be more limited.
bq. -1 on using writeTime for TTL calculation from the UUID (the time can
actually jump, but uuids will always increase, and it's not what we want for
TTL calc)
Do you mean time jumping, if operator forcibly changes time on machine or some
other scenario ?
bq. making the table COMPACT STORAGE limits our flexibility wrt future batchlog
schema changes, so -1 on that
Using it without COMPACT STORAGE will add 2x to memory and disk. Does
supporting change really neccessary ? I did not noticed any changes to original
structure since very beginning of batchlog.
bq. We should avoid any potentially brittle/breaking extra migration code on
the already slow-ish startup.
Um, i did not thinking about migrating old batchlog records on startup. This
cannot be done, because old version nodes will continue to write old format
batchlog entries while operator roll upgrades cluster. What i was thinking is
having BatchlogManagerOld reading from old batchlog CF and replaying batches
old way; And having BatchlogManager, reading from new batchlog2 CF and
replaying batchlogs new way. As soon as all nodes are upgraded they start to
write ti new batchlog2 CF, so BatchlogManagerOld after it precessed all old
records reads nothing from old batchlog CF, and basically does a NOP cycle
every 60 secs. So the migration is not so big deal to aim at not changing
structure of batch log so badly.
> More efficient BatchlogManager
> ------------------------------
>
> Key: CASSANDRA-6134
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6134
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Oleg Anastasyev
> Priority: Minor
> Attachments: BatchlogManager.txt
>
>
> As we discussed earlier in CASSANDRA-6079 this is the new BatchManager.
> It stores batch records in
> {code}
> CREATE TABLE batchlog (
> id_partition int,
> id timeuuid,
> data blob,
> PRIMARY KEY (id_partition, id)
> ) WITH COMPACT STORAGE AND
> CLUSTERING ORDER BY (id DESC)
> {code}
> where id_partition is minute-since-epoch of id uuid.
> So when it scans for batches to replay ot scans within a single partition for
> a slice of ids since last processed date till now minus write timeout.
> So no full batchlog CF scan and lot of randrom reads are made on normal
> cycle.
> Other improvements:
> 1. It runs every 1/2 of write timeout and replays all batches written within
> 0.9 * write timeout from now. This way we ensure, that batched updates will
> be replayed to th moment client times out from coordinator.
> 2. It submits all mutations from single batch in parallel (Like StorageProxy
> do). Old implementation played them one-by-one, so client can see half
> applied batches in CF for a long time (depending on size of batch).
> 3. It fixes a subtle racing bug with incorrect hint ttl calculation
--
This message was sent by Atlassian JIRA
(v6.1#6144)