[ 
https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13802135#comment-13802135
 ] 

Oleg Anastasyev commented on CASSANDRA-6134:
--------------------------------------------

bq. This isn't what we want to ensure though. The current timeout (write 
timeout * 2) is there to account for maximum batchlog write timeout + actual 
data write timeout. Avoiding extra mutations is IMO more important than having 
less delay in the failure scenario (and slow writes would happen more often 
than outright failures). 

As we discussed earlier, whole batchlog thing makes little sense, if clients 
cannot read their own writes. Consider client written to batchlog very fast and 
timed out from coordinator having batch half applied. Reading from another 
coordinator it would see batch partially applied for almost yet another write 
timeout. So just having write timeout*2 is not a good idea. From the other 
hand, "hammering" is one-by-one replay of unplayed mutation.  Dont think this 
could be an issue practically. +1 having RateLimiter there, so hammering could 
be more limited.

bq. -1 on using writeTime for TTL calculation from the UUID (the time can 
actually jump, but uuids will always increase, and it's not what we want for 
TTL calc)

Do you mean time jumping, if operator forcibly changes time on machine or some 
other scenario ?

bq. making the table COMPACT STORAGE limits our flexibility wrt future batchlog 
schema changes, so -1 on that

Using it without COMPACT STORAGE will add 2x to memory and disk. Does 
supporting change really neccessary ? I did not noticed any changes to original 
structure since very beginning of batchlog.

bq. We should avoid any potentially brittle/breaking extra migration code on 
the already slow-ish startup.

Um, i did not thinking about migrating old batchlog records on startup. This 
cannot be done, because old version nodes will continue to write old format 
batchlog entries while operator roll upgrades cluster. What i was thinking is 
having BatchlogManagerOld reading from old batchlog CF and replaying batches 
old way; And having BatchlogManager, reading from new batchlog2 CF and 
replaying batchlogs new way. As soon as all nodes are upgraded they start to 
write ti new batchlog2 CF, so BatchlogManagerOld after it precessed all old 
records reads nothing from old batchlog CF, and basically does a NOP cycle 
every 60 secs. So the migration is not so big deal to aim at not changing 
structure of batch log so badly.

> More efficient BatchlogManager
> ------------------------------
>
>                 Key: CASSANDRA-6134
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Oleg Anastasyev
>            Priority: Minor
>         Attachments: BatchlogManager.txt
>
>
> As we discussed earlier in CASSANDRA-6079 this is the new BatchManager.
> It stores batch records in 
> {code}
> CREATE TABLE batchlog (
>   id_partition int,
>   id timeuuid,
>   data blob,
>   PRIMARY KEY (id_partition, id)
> ) WITH COMPACT STORAGE AND
>   CLUSTERING ORDER BY (id DESC)
> {code}
> where id_partition is minute-since-epoch of id uuid. 
> So when it scans for batches to replay ot scans within a single partition for 
>  a slice of ids since last processed date till now minus write timeout.
> So no full batchlog CF scan and lot of randrom reads are made on normal 
> cycle. 
> Other improvements:
> 1. It runs every 1/2 of write timeout and replays all batches written within 
> 0.9 * write timeout from now. This way we ensure, that batched updates will 
> be replayed to th moment client times out from coordinator.
> 2. It submits all mutations from single batch in parallel (Like StorageProxy 
> do). Old implementation played them one-by-one, so client can see half 
> applied batches in CF for a long time (depending on size of batch).
> 3. It fixes a subtle racing bug with incorrect hint ttl calculation



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to