Thanks a lot, Ewen! I'll make sure the documentation is clear on the differences between retries an tolerance.
Do you think percentage would have the same problem as the one you brought up? Also, if we say 10% tolerance, do we have to wait for the duration to finish before failing the task, or should we fail as soon as we hit 10% error. Alternatively, do you think making tolerance an Enum would be simpler? Where it's values are NONE (any errors kill), ALL (tolerate all errors and skip records) and FIRST (tolerate the first error, but fail after that)? Best, On Mon, May 21, 2018 at 11:28 AM, Ewen Cheslack-Postava <e...@confluent.io> wrote: > Arjun, > > Understood on retries vs tolerance -- though I suspect this will end up > being a bit confusing to users as well. It's two levels of error handling > which is what tripped me up. > > One last comment on KIP (which otherwise looks good): for the tolerance > setting, do we want it to be an absolute value or something like a > percentage? Given the current way of setting things, I'm not sure I'd ever > set it to anything but -1 or 0, with maybe 1 as an easy option for > restarting a connector to get it past one bad message, then reverting back > to -1 or 0. > > -Ewen > > On Mon, May 21, 2018 at 11:01 AM Arjun Satish <arjun.sat...@gmail.com> > wrote: > > > Hey Jason, > > > > Thanks for your comments. Please find answers inline: > > > > On Mon, May 21, 2018 at 9:52 AM, Jason Gustafson <ja...@confluent.io> > > wrote: > > > > > Hi Arjun, > > > > > > Thanks for the KIP. Just a few comments/questions: > > > > > > 1. The proposal allows users to configure the number of retries. I > > usually > > > find it easier as a user to work with timeouts since it's difficult to > > know > > > how long a retry might take. Have you considered adding a timeout > option > > > which would retry until the timeout expires? > > > > > > > Good point. Updated the KIP. > > > > 2. The configs are named very generically (e.g. errors.retries.limit). Do > > > you think it will be clear to users what operations these configs apply > > to? > > > > > > > As of now, these configs are applicable to all operations in the > connector > > pipeline (as mentioned in the proposed changes section). We decided not > to > > have per operation limit because of the additional config complexity. > > > > > > > 3. I wasn't entirely clear what messages are stored in the dead letter > > > queue. It sounds like it includes both configs and messages since we > have > > > errors.dlq.include.configs? Is there a specific schema you have in > mind? > > > > > > > This has been addressed in the KIP. The DLQ will now contain only raw > > messages (no additional context). We are also supporting DLQs only for > sink > > connectors now. > > > > > > > 4. I didn't see it mentioned explicitly in the KIP, but I assume the > > > tolerance metrics are reset after every task rebalance? > > > > > > > Great question. Yes, we will reset the tolerance metrics on every > > rebalance. > > > > > > > 5. I wonder if we can do without errors.tolerance.limit. You can get a > > > similar effect using errors.tolerance.rate.limit if you allow longer > > > durations. I'm not sure how useful an absolute counter is in practice. > > > > > > > Yeah, the rate limit does subsume the features offered by the absolute > > counter. Removed it. > > > > > > > > > > Thanks, > > > Jason > > > > > > > > >