Arjun, Understood on retries vs tolerance -- though I suspect this will end up being a bit confusing to users as well. It's two levels of error handling which is what tripped me up.
One last comment on KIP (which otherwise looks good): for the tolerance setting, do we want it to be an absolute value or something like a percentage? Given the current way of setting things, I'm not sure I'd ever set it to anything but -1 or 0, with maybe 1 as an easy option for restarting a connector to get it past one bad message, then reverting back to -1 or 0. -Ewen On Mon, May 21, 2018 at 11:01 AM Arjun Satish <arjun.sat...@gmail.com> wrote: > Hey Jason, > > Thanks for your comments. Please find answers inline: > > On Mon, May 21, 2018 at 9:52 AM, Jason Gustafson <ja...@confluent.io> > wrote: > > > Hi Arjun, > > > > Thanks for the KIP. Just a few comments/questions: > > > > 1. The proposal allows users to configure the number of retries. I > usually > > find it easier as a user to work with timeouts since it's difficult to > know > > how long a retry might take. Have you considered adding a timeout option > > which would retry until the timeout expires? > > > > Good point. Updated the KIP. > > 2. The configs are named very generically (e.g. errors.retries.limit). Do > > you think it will be clear to users what operations these configs apply > to? > > > > As of now, these configs are applicable to all operations in the connector > pipeline (as mentioned in the proposed changes section). We decided not to > have per operation limit because of the additional config complexity. > > > > 3. I wasn't entirely clear what messages are stored in the dead letter > > queue. It sounds like it includes both configs and messages since we have > > errors.dlq.include.configs? Is there a specific schema you have in mind? > > > > This has been addressed in the KIP. The DLQ will now contain only raw > messages (no additional context). We are also supporting DLQs only for sink > connectors now. > > > > 4. I didn't see it mentioned explicitly in the KIP, but I assume the > > tolerance metrics are reset after every task rebalance? > > > > Great question. Yes, we will reset the tolerance metrics on every > rebalance. > > > > 5. I wonder if we can do without errors.tolerance.limit. You can get a > > similar effect using errors.tolerance.rate.limit if you allow longer > > durations. I'm not sure how useful an absolute counter is in practice. > > > > Yeah, the rate limit does subsume the features offered by the absolute > counter. Removed it. > > > > > > Thanks, > > Jason > > > > >