Hi Andrew and team,

Congrats on the KIP passing. The design is really solid and much needed for
the "Queues for Kafka" roadmap. I've been tied up, but finally had a chance
to look at the implementation path for share groups and wanted to flag a
few "day 2" operational risks. In my experience with high-throughput
pipelines, these are the edge cases that usually lead to 2 AM outages if
the broker-side logic isn't tightened up before GA.

1. Coordinator Failover & Duplicates
The KIP admits that DLQ writes and state topic updates aren't atomic,
meaning a coordinator failover (and PID reset) will cause duplicates. For
anyone in finance or regulated industries, this breaks the 1:1 audit trail
we rely on for compliance. This is a critical gap. We need a clear plan for
deduplication during the coordinator recovery path.

2. Handling a Stuck ARCHIVING State
If the DLQ topic goes offline or hits a leader election, we can't let
records sit in ARCHIVING indefinitely. Without a configurable
errors.deadletterqueue.write.timeout.ms, records could stay stuck during a
sustained outage, creating unbounded memory pressure. I'd suggest a
fall-through to ARCHIVED with a logged error to keep the system alive if
the DLQ is unreachable.

3. Bounded Retries on the Broker
The KIP mentions retrying on metadata/leadership issues but doesn't specify
a limit. I'd propose a new config — errors.deadletterqueue.write.retries to
provide a clean exit condition. Without a cap, a total partition failure
could trigger an indefinite retry loop, wasting broker I/O and CPU.

4. Circuit Breaker for Systemic Failures
This is the most critical point for me. If a downstream service dies, the
share group will hit the delivery limit for every message, effectively
draining the main topic into the DLQ in minutes. This kills message order
and makes re-processing a nightmare. I'd propose a threshold  if >20% of
messages hit the DLQ in a rolling window, the group should PAUSE. It's
always safer to stop the group than to dump the whole topic.

5. Mandatory Disposition Headers
Since the broker already knows if a record failed due to
MAX_DELIVERY_ATTEMPTS_REACHED vs. an explicit CLIENT_REJECTED NACK, we
should make that a mandatory _dlq.errors.disposition header. Without it,
operators can't distinguish a poison pill from a systemic timeout without
digging through broker logs.

6. DLQ Ownership Check
We should add a check at the coordinator level to ensure a DLQ topic isn't
shared by multiple groups. Cross-contamination makes the DLQ useless for
debugging if you're seeing failures from unrelated applications in the same
stream.

I'm particularly interested in your thoughts on the circuit breaker and the
write timeouts, as those seem like the biggest stability risks at scale.
Happy to help spec either of these out if the team finds them worthwhile.

Best regards,
Vaquar Khan
*Linkedin *-https://www.linkedin.com/in/vaquar-khan-b695577/
*Book *-
https://us.amazon.com/stores/Vaquar-Khan/author/B0DMJCG9W6?ref=ap_rdr&shoppingPortalEnabled=true
*GitBook*-https://vaquarkhan.github.io/microservices-recipes-a-free-gitbook/
*Stack *-https://stackoverflow.com/users/4812170/vaquar-khan
*github*-https://github.com/vaquarkhan

Reply via email to