Thank you for the quick response Jun. Excuse the delayed response but
I wanted to confirm some things regarding IBP. See comments below.

Here are my changes to the KIP:
https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=158864763&selectedPageVersions=30&selectedPageVersions=28

> 40. LBO: Code wise, logStartOffset is used in so many places like Log,
> ReplicaManager, LogCleaner, ReplicaFetcher, checkpoint files, etc. I am not
> if it's worth renaming in all those places. If the main concern is to
> avoid confusion, we can just always spell out logStartOffset.

Done. Keeping it as LogStartOffset is better. I was also concerned
with external tools that may be generating code from the JSON schema.

> 41. metadata.snapshot.min.cleanable.ratio: Since the snapshot could be
> empty initially, it's probably better to define the ratio as new_data_size
> / (new_data_size + snapshot_size). This avoids the dividing by zero issue
> and is also consistent with the cleaner ratio definition for compacted
> topics.

I am assuming that snapshot_size is the size of the largest snapshot
on disk, is this correct? If we use this formal then we will generate
snapshots very quickly if the snapshot on disk is zero or very small.

In general what we care about is if the replicated log has a lot of
records that delete or update records in the snapshot. I was thinking
something along the following formula:

(size of delete snapshot records + size of updated records) / (total
size of snapshot), where total size of snapshot is greater than zero.
0, where total size of snapshot is zero

This means that in the extreme case where the replicated log only
contains "addition" records then we never generate a snapshot. I think
this is the desired behavior since generating a snapshot will consume
disk bandwidth without saving disk space. What do you think?

>
> 42. FetchSnapshotRequest: Since this is designed to fetch more than one
> partition, it seems it's useful to have a per-partition maxBytes, in
> addition to the request level maxBytes, just like a Fetch request?

Yeah, we have debated this in another thread from Jason. The argument
is that MaxBytes at the top level is all that we need if we implement
the following heuristic:

1. Compute the average max bytes per partition by dividing the max by
the number of partitions in the request.
2. For all of the partitions with remaining bytes less than this
average max bytes, then send all of those bytes and sum the remaining
bytes.
3. For the remaining partitions send at most the average max bytes
plus the average remaining bytes.

Note that this heuristic will only be performed once and not at worst
N times for N partitions.

What do you think? Besides consistency with Fetch requests, is there
another reason to have MaxBytes per partition?

> 43. FetchSnapshotResponse:
> 43.1 I think the confusing part for OFFSET_OUT_OF_RANGE is
> that FetchSnapshotRequest includes EndOffset. So, OFFSET_OUT_OF_RANGE seems
> to suggest that the provided EndOffset is wrong, which is not the intention
> for the error code.

Yeah. Added a new error called POSITION_OUT_OF_RANGE.

> 43.1 Position field seems to be the same as the one in
> FetchSnapshotRequest. If we have both, should the requester verify the
> consistency between two values and what should the requester do if the two
> values don't match?

Yeah the Position in the response will be the same value as the
Position in the request. I was thinking of only verifying Position
against the state on the temporary snapshot file on disk. If Position
is not equal to the size of the file then reject the response and send
another FetchSnapshot request.

> 44. metric: Would a metric that captures the lag in offset between the last
> snapshot and the logEndOffset be useful?

Yes. How about the difference between the last snapshot offset and the
high-watermark? Snapshot can only be created up to the high-watermark.

Added this metric. Let me know if you still think we need a metric for
the difference between the largest snapshot end offset and the
high-watermark.

> 45. It seems the KIP assumes that every voter (leader and follower) and
> observer has a local replicated log for __cluster_metadata. It would be
> useful to make that clear in the overview section.

Updated the overview section. I think that this decision affects the
section "Changes to Leader Election". That section should not affect
observers since they don't participate in leader elections. It also
affects the section "Validation of Snapshot and Log" but it should be
possible to fix that section if observers don't have the replicated
log on disk.

> 46. Does this KIP cover upgrading from older versions of Kafka? If so, do
> we need IBP to guard the usage of modified FetchRequest and the new
> FetchSnapshotRequest? If not, could we make it clear that upgrading will be
> covered somewhere else?

In short, I don't think we need to increase the IBP. When we implement
snapshots for other topics like __consumer_offset and __transaction
then we will have to increase the IBP. We will need another KIP to do
that. I updated the "Compatibility, Deprecation, and Migration Plan"
section to include the following information.

This KIP is only implemented for the internal topic
__cluster_metadata. An increase of the inter broker protocol (IBP) is
not required, since we are only implementing this for the
__cluster_metadata topic partition, that partition is a new partition
and will be handle by the KafkaRaftClient. Internal and external
clients for all other topics can ignore the SnapshotId as that field
will not be set for topic partitions that are not __cluster_metadata.


Thanks,
-- 
-Jose

Reply via email to