Hi Arjun, I'm following this very closely as better error handling in Connect is a high priority for MailChimp's Data Systems team.
A few thoughts (in no particular order): For the dead letter queue configuration, could we use deadLetterQueue instead of dlq? Acronyms are notoriously hard to keep straight in everyone's head and unless there's a compelling reason it would be nice to use the characters and be explicit. Have you considered any behavior that would periodically attempt to restart failed tasks after a certain amount of time? To get around our issues internally we've deployed a tool that monitors for failed tasks and restarts the task by hitting the REST API after the failure. Such a config would allow us to get rid of this tool. Have you considered a config setting to allow-list additional classes as retryable? In the situation we ran into, we were getting ConnectExceptions that were intermittent due to an unrelated service. With such a setting we could have deployed a config that temporarily whitelisted that Exception as retry-worthy and continued attempting to make progress while the other team worked on mitigating the problem. Thanks for the KIP! On Wed, May 9, 2018 at 2:59 AM, Arjun Satish <arjun.sat...@gmail.com> wrote: > All, > > I'd like to start a discussion on adding ways to handle and report record > processing errors in Connect. Please find a KIP here: > > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > 298%3A+Error+Handling+in+Connect > > Any feedback will be highly appreciated. > > Thanks very much, > Arjun >