Hi Arjun,

I'm following this very closely as better error handling in Connect is a
high priority
for MailChimp's Data Systems team.

A few thoughts (in no particular order):

For the dead letter queue configuration, could we use deadLetterQueue
instead of
dlq? Acronyms are notoriously hard to keep straight in everyone's head and
unless
there's a compelling reason it would be nice to use the characters and be
explicit.

Have you considered any behavior that would periodically attempt to restart
failed
tasks after a certain amount of time? To get around our issues internally
we've
deployed a tool that monitors for failed tasks and restarts the task by
hitting the
REST API after the failure. Such a config would allow us to get rid of this
tool.

Have you considered a config setting to allow-list additional classes as
retryable? In the situation we ran into, we were getting ConnectExceptions
that
were intermittent due to an unrelated service. With such a setting we could
have
deployed a config that temporarily whitelisted that Exception as
retry-worthy
and continued attempting to make progress while the other team worked
on mitigating the problem.

Thanks for the KIP!

On Wed, May 9, 2018 at 2:59 AM, Arjun Satish <arjun.sat...@gmail.com> wrote:

> All,
>
> I'd like to start a discussion on adding ways to handle and report record
> processing errors in Connect. Please find a KIP here:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 298%3A+Error+Handling+in+Connect
>
> Any feedback will be highly appreciated.
>
> Thanks very much,
> Arjun
>

Reply via email to