Re: [DISCUSS]: KIP-161: streams record processing exception handlers

Eno Thereska Fri, 23 Jun 2017 03:00:42 -0700
Done, thanks. I'll open a vote thread now.

Eno
> On 23 Jun 2017, at 02:15, Matthias J. Sax <matth...@confluent.io> wrote:
> 
> I also think, that one config is better, with two default
> implementations: failing and log-and-continue
> 
> However, I think we should fail by default. Similar to timestamp
> extractor as "silent" data loss is no good default behavior IMHO.
> 
> 
> -Matthias
> 
> On 6/22/17 11:00 AM, Eno Thereska wrote:
>> Answers inline: 
>> 
>>> On 22 Jun 2017, at 03:26, Guozhang Wang <wangg...@gmail.com> wrote:
>>> 
>>> Thanks for the updated KIP, some more comments:
>>> 
>>> 1.The config name is "default.deserialization.exception.handler" while the
>>> interface class name is "RecordExceptionHandler", which is more general
>>> than the intended purpose. Could we rename the class name accordingly?
>> 
>> Sure.
>> 
>> 
>>> 
>>> 2. Could you describe the full implementation of "DefaultExceptionHandler",
>>> currently it is not clear to me how it is implemented with the configured
>>> value.
>>> 
>>> In addition, I think we do not need to include an additional
>>> "DEFAULT_DESERIALIZATION_EXCEPTION_RESPONSE_CONFIG" as the configure()
>>> function is mainly used for users to pass any customized parameters that is
>>> out of the Streams library; plus adding such additional config sounds
>>> over-complicated for a default exception handler. Instead I'd suggest we
>>> just provide two handlers (or three if people feel strong about the
>>> LogAndThresholdExceptionHandler), one for FailOnExceptionHandler and one
>>> for LogAndContinueOnExceptionHandler. And we can set
>>> LogAndContinueOnExceptionHandler
>>> by default.
>>> 
>> 
>> That's what I had originally. Jay mentioned he preferred one default class, 
>> with config options.
>> So with that approach, you'd have 2 config options, one for failing, one for 
>> continuing, and the one
>> exception handler would take those options during it's configure() call.
>> 
>> After checking the other exception handlers in the code, I might revert back 
>> to what I originally had (2 default handlers) 
>> as Guozhang also re-suggests, but still have the interface extend 
>> Configurable. Guozhang, you ok with that? In that case
>> there is no need for the response config option.
>> 
>> Thanks
>> Eno
>> 
>> 
>>> 
>>> Guozhang
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Wed, Jun 21, 2017 at 1:39 AM, Eno Thereska <eno.there...@gmail.com 
>>> <mailto:eno.there...@gmail.com> <mailto:eno.there...@gmail.com 
>>> <mailto:eno.there...@gmail.com>>>
>>> wrote:
>>> 
>>>> Thanks Guozhang,
>>>> 
>>>> I’ve updated the KIP and hopefully addressed all the comments so far. In
>>>> the process also changed the name of the KIP to reflect its scope better:
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-161%3A+streams+ 
>>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-161%3A+streams+><https://cwiki.apache.org/confluence/display/KAFKA/KIP-161%3A+streams+
>>>>  <https://cwiki.apache.org/confluence/display/KAFKA/KIP-161%3A+streams+>>
>>>> deserialization+exception+handlers <https://cwiki.apache.org/ 
>>>> <https://cwiki.apache.org/> <https://cwiki.apache.org/ 
>>>> <https://cwiki.apache.org/>>
>>>> confluence/display/KAFKA/KIP-161:+streams+deserialization+
>>>> exception+handlers>
>>>> 
>>>> Any other feedback appreciated, otherwise I’ll start the vote soon.
>>>> 
>>>> Thanks
>>>> Eno
>>>> 
>>>>> On Jun 12, 2017, at 6:28 AM, Guozhang Wang <wangg...@gmail.com 
>>>>> <mailto:wangg...@gmail.com>> wrote:
>>>>> 
>>>>> Eno, Thanks for bringing this proposal up and sorry for getting late on
>>>>> this. Here are my two cents:
>>>>> 
>>>>> 1. First some meta comments regarding "fail fast" v.s. "making
>>>> progress". I
>>>>> agree that in general we should better "enforce user to do the right
>>>> thing"
>>>>> in system design, but we also need to keep in mind that Kafka is a
>>>>> multi-tenant system, i.e. from a Streams app's pov you probably would not
>>>>> control the whole streaming processing pipeline end-to-end. E.g. Your
>>>> input
>>>>> data may not be controlled by yourself; it could be written by another
>>>> app,
>>>>> or another team in your company, or even a different organization, and if
>>>>> an error happens maybe you cannot fix "to do the right thing" just by
>>>>> yourself in time. In such an environment I think it is important to leave
>>>>> the door open to let users be more resilient. So I find the current
>>>>> proposal which does leave the door open for either fail-fast or make
>>>>> progress quite reasonable.
>>>>> 
>>>>> 2. On the other hand, if the question is whether we should provide a
>>>>> built-in "send to bad queue" handler from the library, I think that might
>>>>> be an overkill: with some tweaks (see my detailed comments below) on the
>>>>> API we can allow users to implement such handlers pretty easily. In
>>>> fact, I
>>>>> feel even "LogAndThresholdExceptionHandler" is not necessary as a
>>>> built-in
>>>>> handler, as it would then require users to specify the threshold via
>>>>> configs, etc. I think letting people provide such "eco-libraries" may be
>>>>> better.
>>>>> 
>>>>> 3. Regarding the CRC error: today we validate CRC on both the broker end
>>>>> upon receiving produce requests and on consumer end upon receiving fetch
>>>>> responses; and if the CRC validation fails in the former case it would
>>>> not
>>>>> be appended to the broker logs. So if we do see a CRC failure on the
>>>>> consumer side it has to be that either we have a flipped bit on the
>>>> broker
>>>>> disks or over the wire. For the first case it is fatal while for the
>>>> second
>>>>> it is retriable. Unfortunately we cannot tell which case it is when
>>>> seeing
>>>>> CRC validation failures. But in either case, just skipping and making
>>>>> progress seems not a good choice here, and hence I would personally
>>>> exclude
>>>>> these errors from the general serde errors to NOT leave the door open of
>>>>> making progress.
>>>>> 
>>>>> Currently such errors are thrown as KafkaException that wraps an
>>>>> InvalidRecordException, which may be too general and we could consider
>>>> just
>>>>> throwing the InvalidRecordException directly. But that could be an
>>>>> orthogonal discussion if we agrees that CRC failures should not be
>>>>> considered in this KIP.
>>>>> 
>>>>> ----------------
>>>>> 
>>>>> Now some detailed comments:
>>>>> 
>>>>> 4. Could we consider adding the processor context in the handle()
>>>> function
>>>>> as well? This context will be wrapping as the source node that is about
>>>> to
>>>>> process the record. This could expose more info like which task / source
>>>>> node sees this error, which timestamp of the message, etc, and also can
>>>>> allow users to implement their handlers by exposing some metrics, by
>>>>> calling context.forward() to implement the "send to bad queue" behavior
>>>> etc.
>>>>> 
>>>>> 5. Could you add the string name of
>>>>> StreamsConfig.DEFAULT_RECORD_EXCEPTION_HANDLER as well in the KIP?
>>>>> Personally I find "default" prefix a bit misleading since we do not allow
>>>>> users to override it per-node yet. But I'm okay either way as I can see
>>>> we
>>>>> may extend it in the future and probably would like to not rename the
>>>>> config again. Also from the experience of `default partitioner` and
>>>>> `default timestamp extractor` we may also make sure that the passed in
>>>>> object can be either a string "class name" or a class object?
>>>>> 
>>>>> 
>>>>> Guozhang
>>>>> 
>>>>> 
>>>>> On Wed, Jun 7, 2017 at 2:16 PM, Jan Filipiak <jan.filip...@trivago.com 
>>>>> <mailto:jan.filip...@trivago.com>>
>>>>> wrote:
>>>>> 
>>>>>> Hi Eno,
>>>>>> 
>>>>>> On 07.06.2017 22:49, Eno Thereska wrote:
>>>>>> 
>>>>>>> Comments inline:
>>>>>>> 
>>>>>>> On 5 Jun 2017, at 18:19, Jan Filipiak <jan.filip...@trivago.com 
>>>>>>> <mailto:jan.filip...@trivago.com>>
>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi
>>>>>>>> 
>>>>>>>> just my few thoughts
>>>>>>>> 
>>>>>>>> On 05.06.2017 11:44, Eno Thereska wrote:
>>>>>>>> 
>>>>>>>>> Hi there,
>>>>>>>>> 
>>>>>>>>> Sorry for the late reply, I was out this past week. Looks like good
>>>>>>>>> progress was made with the discussions either way. Let me recap a
>>>> couple of
>>>>>>>>> points I saw into one big reply:
>>>>>>>>> 
>>>>>>>>> 1. Jan mentioned CRC errors. I think this is a good point. As these
>>>>>>>>> happen in Kafka, before Kafka Streams gets a chance to inspect
>>>> anything,
>>>>>>>>> I'd like to hear the opinion of more Kafka folks like Ismael or
>>>> Jason on
>>>>>>>>> this one. Currently the documentation is not great with what to do
>>>> once a
>>>>>>>>> CRC check has failed. From looking at the code, it looks like the
>>>> client
>>>>>>>>> gets a KafkaException (bubbled up from the fetcher) and currently we
>>>> in
>>>>>>>>> streams catch this as part of poll() and fail. It might be
>>>> advantageous to
>>>>>>>>> treat CRC handling in a similar way to serialisation handling (e.g.,
>>>> have
>>>>>>>>> the option to fail/skip). Let's see what the other folks say.
>>>> Worst-case we
>>>>>>>>> can do a separate KIP for that if it proved too hard to do in one go.
>>>>>>>>> 
>>>>>>>> there is no reasonable way to "skip" a crc error. How can you know the
>>>>>>>> length you read was anything reasonable? you might be completely lost
>>>>>>>> inside your response.
>>>>>>>> 
>>>>>>> On the client side, every record received is checked for validity. As
>>>> it
>>>>>>> happens, if the CRC check fails the exception is wrapped with a
>>>>>>> KafkaException that is thrown all the way to poll(). Assuming we change
>>>>>>> that and poll() throws a CRC exception, I was thinking we could treat
>>>> it
>>>>>>> similarly to a deserialize exception and pass it to the exception
>>>> handler
>>>>>>> to decide what to do. Default would be to fail. This might need a
>>>> Kafka KIP
>>>>>>> btw and can be done separately from this KIP, but Jan, would you find
>>>> this
>>>>>>> useful?
>>>>>>> 
>>>>>> I don't think so. IMO you can not reasonably continue parsing when the
>>>>>> checksum of a message is not correct. If you are not sure you got the
>>>>>> correct length, how can you be sure to find the next record? I would
>>>> always
>>>>>> straight fail in all cases. Its to hard for me to understand why one
>>>> would
>>>>>> try to continue. I mentioned CRC's because thats the only bad pills I
>>>> ever
>>>>>> saw so far. But I am happy that it just stopped and I could check what
>>>> was
>>>>>> going on. This will also be invasive in the client code then.
>>>>>> 
>>>>>> If you ask me, I am always going to vote for "grind to halt" let the
>>>>>> developers see what happened and let them fix it. It helps building good
>>>>>> kafka experiences and better software and architectures. For me this is:
>>>>>> "force the user todo the right thing". https://youtu.be/aAb7hSCtvGw? 
>>>>>> <https://youtu.be/aAb7hSCtvGw?>
>>>> t=374
>>>>>> eg. not letting unexpected input slip by.  Letting unexpected input
>>>> slip by
>>>>>> is what bought us 15+years of war of all sorts of ingestion attacks. I
>>>>>> don't even dare to estimate how many missingrecords-search-teams going
>>>> be
>>>>>> formed, maybe some hackerone for stream apps :D
>>>>>> 
>>>>>> Best Jan
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>>> At a minimum, handling this type of exception will need to involve
>>>> the
>>>>>>>>> exactly-once (EoS) logic. We'd still allow the option of failing or
>>>>>>>>> skipping, but EoS would need to clean up by rolling back all the side
>>>>>>>>> effects from the processing so far. Matthias, how does this sound?
>>>>>>>>> 
>>>>>>>> Eos will not help the record might be 5,6 repartitions down into the
>>>>>>>> topology. I haven't followed but I pray you made EoS optional! We
>>>> don't
>>>>>>>> need this and we don't want this and we will turn it off if it comes.
>>>> So I
>>>>>>>> wouldn't recommend relying on it. The option to turn it off is better
>>>> than
>>>>>>>> forcing it and still beeing unable to rollback badpills (as explained
>>>>>>>> before)
>>>>>>>> 
>>>>>>> Yeah as Matthias mentioned EoS is optional.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Eno
>>>>>>> 
>>>>>>> 
>>>>>>> 6. Will add an end-to-end example as Michael suggested.
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Eno
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 4 Jun 2017, at 02:35, Matthias J. Sax <matth...@confluent.io 
>>>>>>>>> <mailto:matth...@confluent.io>>
>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> What I don't understand is this:
>>>>>>>>>> 
>>>>>>>>>> From there on its the easiest way forward: fix, redeploy, start =>
>>>>>>>>>>> done
>>>>>>>>>>> 
>>>>>>>>>> If you have many producers that work fine and a new "bad" producer
>>>>>>>>>> starts up and writes bad data into your input topic, your Streams
>>>> app
>>>>>>>>>> dies but all your producers, including the bad one, keep writing.
>>>>>>>>>> 
>>>>>>>>>> Thus, how would you fix this, as you cannot "remove" the corrupted
>>>> date
>>>>>>>>>> from the topic? It might take some time to identify the root cause
>>>> and
>>>>>>>>>> stop the bad producer. Up to this point you get good and bad data
>>>> into
>>>>>>>>>> your Streams input topic. If Streams app in not able to skip over
>>>> those
>>>>>>>>>> bad records, how would you get all the good data from the topic? Not
>>>>>>>>>> saying it's not possible, but it's extra work copying the data with
>>>> a
>>>>>>>>>> new non-Streams consumer-producer-app into a new topic and than feed
>>>>>>>>>> your Streams app from this new topic -- you also need to update all
>>>>>>>>>> your
>>>>>>>>>> upstream producers to write to the new topic.
>>>>>>>>>> 
>>>>>>>>>> Thus, if you want to fail fast, you can still do this. And after you
>>>>>>>>>> detected and fixed the bad producer you might just reconfigure your
>>>> app
>>>>>>>>>> to skip bad records until it reaches the good part of the data.
>>>>>>>>>> Afterwards, you could redeploy with fail-fast again.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thus, for this pattern, I actually don't see any reason why to stop
>>>> the
>>>>>>>>>> Streams app at all. If you have a callback, and use the callback to
>>>>>>>>>> raise an alert (and maybe get the bad data into a bad record
>>>> queue), it
>>>>>>>>>> will not take longer to identify and stop the "bad" producer. But
>>>> for
>>>>>>>>>> this case, you have zero downtime for your Streams app.
>>>>>>>>>> 
>>>>>>>>>> This seems to be much simpler. Or do I miss anything?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Having said this, I agree that the "threshold based callback" might
>>>> be
>>>>>>>>>> questionable. But as you argue for strict "fail-fast", I want to
>>>> argue
>>>>>>>>>> that this must not always be the best pattern to apply and that the
>>>>>>>>>> overall KIP idea is super useful from my point of view.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> -Matthias
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 6/3/17 11:57 AM, Jan Filipiak wrote:
>>>>>>>>>> 
>>>>>>>>>>> Could not agree more!
>>>>>>>>>>> 
>>>>>>>>>>> But then I think the easiest is still: print exception and die.
>>>>>>>>>>> From there on its the easiest way forward: fix, redeploy, start =>
>>>>>>>>>>> done
>>>>>>>>>>> 
>>>>>>>>>>> All the other ways to recover a pipeline that was processing
>>>> partially
>>>>>>>>>>> all the time
>>>>>>>>>>> and suddenly went over a "I cant take it anymore" threshold is not
>>>>>>>>>>> straight forward IMO.
>>>>>>>>>>> 
>>>>>>>>>>> How to find the offset, when it became to bad when it is not the
>>>>>>>>>>> latest
>>>>>>>>>>> commited one?
>>>>>>>>>>> How to reset there? with some reasonable stuff in your rockses?
>>>>>>>>>>> 
>>>>>>>>>>> If one would do the following. The continuing Handler would measure
>>>>>>>>>>> for
>>>>>>>>>>> a threshold and
>>>>>>>>>>> would terminate after a certain threshold has passed (per task).
>>>> Then
>>>>>>>>>>> one can use offset commit/ flush intervals
>>>>>>>>>>> to make reasonable assumption of how much is slipping by + you get
>>>> an
>>>>>>>>>>> easy recovery when it gets to bad
>>>>>>>>>>> + you could also account for "in processing" records.
>>>>>>>>>>> 
>>>>>>>>>>> Setting this threshold to zero would cover all cases with 1
>>>>>>>>>>> implementation. It is still beneficial to have it pluggable
>>>>>>>>>>> 
>>>>>>>>>>> Again CRC-Errors are the only bad pills we saw in production for
>>>> now.
>>>>>>>>>>> 
>>>>>>>>>>> Best Jan
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 02.06.2017 17:37, Jay Kreps wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Jan, I agree with you philosophically. I think one practical
>>>>>>>>>>>> challenge
>>>>>>>>>>>> has
>>>>>>>>>>>> to do with data formats. Many people use untyped events, so there
>>>> is
>>>>>>>>>>>> simply
>>>>>>>>>>>> no guarantee on the form of the input. E.g. many companies use
>>>> JSON
>>>>>>>>>>>> without
>>>>>>>>>>>> any kind of schema so it becomes very hard to assert anything
>>>> about
>>>>>>>>>>>> the
>>>>>>>>>>>> input which makes these programs very fragile to the "one
>>>> accidental
>>>>>>>>>>>> message publication that creates an unsolvable problem.
>>>>>>>>>>>> 
>>>>>>>>>>>> For that reason I do wonder if limiting to just serialization
>>>>>>>>>>>> actually
>>>>>>>>>>>> gets
>>>>>>>>>>>> you a useful solution. For JSON it will help with the problem of
>>>>>>>>>>>> non-parseable JSON, but sounds like it won't help in the case
>>>> where
>>>>>>>>>>>> the
>>>>>>>>>>>> JSON is well-formed but does not have any of the fields you expect
>>>>>>>>>>>> and
>>>>>>>>>>>> depend on for your processing. I expect the reason for limiting
>>>> the
>>>>>>>>>>>> scope
>>>>>>>>>>>> is it is pretty hard to reason about correctness for anything that
>>>>>>>>>>>> stops in
>>>>>>>>>>>> the middle of processing an operator DAG?
>>>>>>>>>>>> 
>>>>>>>>>>>> -Jay
>>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Jun 2, 2017 at 4:50 AM, Jan Filipiak <
>>>>>>>>>>>> jan.filip...@trivago.com <mailto:jan.filip...@trivago.com>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> IMHO your doing it wrong then. + building to much support into the
>>>>>>>>>>>>> kafka
>>>>>>>>>>>>> eco system is very counterproductive in fostering a happy
>>>> userbase
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 02.06.2017 13:15, Damian Guy wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Jan, you have a choice to Fail fast if you want. This is about
>>>>>>>>>>>>>> giving
>>>>>>>>>>>>>> people options and there are times when you don't want to fail
>>>>>>>>>>>>>> fast.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Fri, 2 Jun 2017 at 11:00 Jan Filipiak <
>>>> jan.filip...@trivago.com <mailto:jan.filip...@trivago.com>
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1.
>>>>>>>>>>>>>>> That greatly complicates monitoring.  Fail Fast gives you that
>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>> monitor only the lag of all your apps
>>>>>>>>>>>>>>> you are completely covered. With that sort of new application
>>>>>>>>>>>>>>> Monitoring
>>>>>>>>>>>>>>> is very much more complicated as
>>>>>>>>>>>>>>> you know need to monitor fail % of some special apps aswell.
>>>> In my
>>>>>>>>>>>>>>> opinion that is a huge downside already.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 2.
>>>>>>>>>>>>>>> using a schema regerstry like Avrostuff it might not even be
>>>> the
>>>>>>>>>>>>>>> record
>>>>>>>>>>>>>>> that is broken, it might be just your app
>>>>>>>>>>>>>>> unable to fetch a schema it needs now know. Maybe you got
>>>>>>>>>>>>>>> partitioned
>>>>>>>>>>>>>>> away from that registry.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 3. When you get alerted because of to high fail percentage.
>>>> what
>>>>>>>>>>>>>>> are the
>>>>>>>>>>>>>>> steps you gonna do?
>>>>>>>>>>>>>>> shut it down to buy time. fix the problem. spend way to much
>>>> time
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> find a good reprocess offset.
>>>>>>>>>>>>>>> Your timewindows are in bad shape anyways, and you pretty much
>>>>>>>>>>>>>>> lost.
>>>>>>>>>>>>>>> This routine is nonsense.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Dead letter queues would be the worst possible addition to the
>>>>>>>>>>>>>>> kafka
>>>>>>>>>>>>>>> toolkit that I can think of. It just doesn't fit the
>>>> architecture
>>>>>>>>>>>>>>> of having clients falling behind is a valid option.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Further. I mentioned already the only bad pill ive seen so far
>>>> is
>>>>>>>>>>>>>>> crc
>>>>>>>>>>>>>>> errors. any plans for those?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Best Jan
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 02.06.2017 11:34, Damian Guy wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I agree with what Matthias has said w.r.t failing fast. There
>>>> are
>>>>>>>>>>>>>>>> plenty
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> times when you don't want to fail-fast and must attempt to
>>>> make
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> progress.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The dead-letter queue is exactly for these circumstances. Of
>>>>>>>>>>>>>>>> course if
>>>>>>>>>>>>>>>> every record is failing, then you probably do want to give up.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Fri, 2 Jun 2017 at 07:56 Matthias J. Sax <
>>>>>>>>>>>>>>>> matth...@confluent.io <mailto:matth...@confluent.io>>
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> First a meta comment. KIP discussion should take place on the
>>>> dev
>>>>>>>>>>>>>>>> list
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> -- if user list is cc'ed please make sure to reply to both
>>>>>>>>>>>>>>>>> lists.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>> Thanks for making the scope of the KIP clear. Makes a lot of
>>>>>>>>>>>>>>>> sense to
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> focus on deserialization exceptions for now.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> With regard to corrupted state stores, would it make sense to
>>>>>>>>>>>>>>>>> fail a
>>>>>>>>>>>>>>>>> task and wipe out the store to repair it via recreation from
>>>> the
>>>>>>>>>>>>>>>>> changelog? That's of course a quite advance pattern, but I
>>>> want
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> bring
>>>>>>>>>>>>>>>>> it up to design the first step in a way such that we can get
>>>>>>>>>>>>>>>>> there (if
>>>>>>>>>>>>>>>>> we think it's a reasonable idea).
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I also want to comment about fail fast vs making progress. I
>>>>>>>>>>>>>>>>> think that
>>>>>>>>>>>>>>>>> fail-fast must not always be the best option. The scenario I
>>>>>>>>>>>>>>>>> have in
>>>>>>>>>>>>>>>>> mind is like this: you got a bunch of producers that feed the
>>>>>>>>>>>>>>>>> Streams
>>>>>>>>>>>>>>>>> input topic. Most producers work find, but maybe one producer
>>>>>>>>>>>>>>>>> miss
>>>>>>>>>>>>>>>>> behaves and the data it writes is corrupted. You might not
>>>> even
>>>>>>>>>>>>>>>>> be able
>>>>>>>>>>>>>>>>> to recover this lost data at any point -- thus, there is no
>>>>>>>>>>>>>>>>> reason to
>>>>>>>>>>>>>>>>> stop processing but you just skip over those records. Of
>>>>>>>>>>>>>>>>> course, you
>>>>>>>>>>>>>>>>> need to fix the root cause, and thus you need to alert
>>>> (either
>>>>>>>>>>>>>>>>> via logs
>>>>>>>>>>>>>>>>> of the exception handler directly) and you need to start to
>>>>>>>>>>>>>>>>> investigate
>>>>>>>>>>>>>>>>> to find the bad producer, shut it down and fix it.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Here the dead letter queue comes into place. From my
>>>>>>>>>>>>>>>>> understanding, the
>>>>>>>>>>>>>>>>> purpose of this feature is solely enable post debugging. I
>>>> don't
>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>> those record would be fed back at any point in time (so I
>>>> don't
>>>>>>>>>>>>>>>>> see any
>>>>>>>>>>>>>>>>> ordering issue -- a skipped record, with this regard, is just
>>>>>>>>>>>>>>>>> "fully
>>>>>>>>>>>>>>>>> processed"). Thus, the dead letter queue should actually
>>>> encode
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> original records metadata (topic, partition offset etc) to
>>>>>>>>>>>>>>>>> enable
>>>>>>>>>>>>>>>>> such
>>>>>>>>>>>>>>>>> debugging. I guess, this might also be possible if you just
>>>> log
>>>>>>>>>>>>>>>>> the bad
>>>>>>>>>>>>>>>>> records, but it would be harder to access (you first must
>>>> find
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> Streams instance that did write the log and extract the
>>>>>>>>>>>>>>>>> information
>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>> there). Reading it from topic is much simpler.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I also want to mention the following. Assume you have such a
>>>>>>>>>>>>>>>>> topic with
>>>>>>>>>>>>>>>>> some bad records and some good records. If we always
>>>> fail-fast,
>>>>>>>>>>>>>>>>> it's
>>>>>>>>>>>>>>>>> going to be super hard to process the good data. You would
>>>> need
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> write
>>>>>>>>>>>>>>>>> an extra app that copied the data into a new topic filtering
>>>>>>>>>>>>>>>>> out the
>>>>>>>>>>>>>>>>> bad
>>>>>>>>>>>>>>>>> records (or apply the map() workaround withing stream). So I
>>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>> that failing fast is most likely the best option in
>>>> production
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> necessarily, true.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Or do you think there are scenarios, for which you can
>>>> recover
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> corrupted records successfully? And even if this is
>>>> possible, it
>>>>>>>>>>>>>>>>> might
>>>>>>>>>>>>>>>>> be a case for reprocessing instead of failing the whole
>>>>>>>>>>>>>>>>> application?
>>>>>>>>>>>>>>>>> Also, if you think you can "repair" a corrupted record,
>>>> should
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> handler allow to return a "fixed" record? This would solve
>>>> the
>>>>>>>>>>>>>>>>> ordering
>>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On 5/30/17 1:47 AM, Michael Noll wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks for your work on this KIP, Eno -- much appreciated!
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> - I think it would help to improve the KIP by adding an
>>>>>>>>>>>>>>>>>> end-to-end
>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>> example that demonstrates, with the DSL and with the
>>>> Processor
>>>>>>>>>>>>>>>>>> API,
>>>>>>>>>>>>>>>>>> how
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> user would write a simple application that would then be
>>>>>>>>>>>>>>>>>> augmented
>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> proposed KIP changes to handle exceptions.  It should also
>>>>>>>>>>>>>>>>>> become much
>>>>>>>>>>>>>>>>>> clearer then that e.g. the KIP would lead to different code
>>>>>>>>>>>>>>>>>> paths for
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> happy case and any failure scenarios.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> - Do we have sufficient information available to make
>>>> informed
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> decisions
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> what to do next?  For example, do we know in which part of
>>>> the
>>>>>>>>>>>>>>>>>> topology
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> record failed? `ConsumerRecord` gives us access to topic,
>>>>>>>>>>>>>>>>>> partition,
>>>>>>>>>>>>>>>>>> offset, timestamp, etc., but what about topology-related
>>>>>>>>>>>>>>>>>> information
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> (e.g.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> what is the associated state store, if any)?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> - Only partly on-topic for the scope of this KIP, but this
>>>> is
>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> bigger picture: This KIP would give users the option to send
>>>>>>>>>>>>>>>>>> corrupted
>>>>>>>>>>>>>>>>>> records to dead letter queue (quarantine topic).  But, what
>>>>>>>>>>>>>>>>>> pattern
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> we advocate to process such a dead letter queue then, e.g.
>>>> how to
>>>>>>>>>>>>>>>> allow
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> retries with backoff ("If the first record in the dead letter
>>>>>>>>>>>>>>>>>> queue
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> fails
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> again, then try the second record for the time being and go
>>>> back
>>>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> first record at a later time").  Jay and Jan already alluded
>>>> to
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> ordering
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> problems that will be caused by dead letter queues. As I said,
>>>>>>>>>>>>>>>> retries
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> might be out of scope but perhaps the implications should be
>>>>>>>>>>>>>>>>>> considered
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> possible?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Also, I wrote the text below before reaching the point in
>>>> the
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> conversation
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> that this KIP's scope will be limited to exceptions in the
>>>>>>>>>>>>>>>>>> category of
>>>>>>>>>>>>>>>>>> poison pills / deserialization errors.  But since Jay
>>>> brought
>>>>>>>>>>>>>>>>>> up
>>>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> errors again, I decided to include it again.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> ----------------------------snip--------------------------
>>>> --
>>>>>>>>>>>>>>>>>> A meta comment: I am not sure about this split between the
>>>>>>>>>>>>>>>>>> code for
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> happy path (e.g. map/filter/... in the DSL) from the failure
>>>>>>>>>>>>>>>>>> path
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> (using
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> exception handlers).  In Scala, for example, we can do:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>     scala> val computation = scala.util.Try(1 / 0)
>>>>>>>>>>>>>>>>>>     computation: scala.util.Try[Int] =
>>>>>>>>>>>>>>>>>> Failure(java.lang.ArithmeticException: / by zero)
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>     scala> computation.getOrElse(42)
>>>>>>>>>>>>>>>>>>     res2: Int = 42
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Another example with Scala's pattern matching, which is
>>>>>>>>>>>>>>>>>> similar to
>>>>>>>>>>>>>>>>>> `KStream#branch()`:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>     computation match {
>>>>>>>>>>>>>>>>>>       case scala.util.Success(x) => x * 5
>>>>>>>>>>>>>>>>>>       case scala.util.Failure(_) => 42
>>>>>>>>>>>>>>>>>>     }
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> (The above isn't the most idiomatic way to handle this in
>>>>>>>>>>>>>>>>>> Scala,
>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> that's
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> not the point I'm trying to make here.)
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hence the question I'm raising here is: Do we want to have
>>>> an
>>>>>>>>>>>>>>>>>> API
>>>>>>>>>>>>>>>>>> where
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> code "the happy path", and then have a different code path
>>>> for
>>>>>>>>>>>>>>>>>> failures
>>>>>>>>>>>>>>>>>> (using exceptions and handlers);  or should we treat both
>>>>>>>>>>>>>>>>>> Success and
>>>>>>>>>>>>>>>>>> Failure in the same way?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I think the failure/exception handling approach (as
>>>> proposed in
>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> KIP)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> is well-suited for errors in the category of deserialization
>>>>>>>>>>>>>>>> problems
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> aka
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> poison pills, partly because the (default) serdes are defined
>>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> configuration (explicit serdes however are defined through
>>>> API
>>>>>>>>>>>>>>>>>> calls).
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> However, I'm not yet convinced that the failure/exception
>>>>>>>>>>>>>>>>>> handling
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> approach
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> is the best idea for user code exceptions, e.g. if you fail
>>>> to
>>>>>>>>>>>>>>>>>> guard
>>>>>>>>>>>>>>>>>> against NPE in your lambdas or divide a number by zero.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>     scala> val stream = Seq(1, 2, 3, 4, 5)
>>>>>>>>>>>>>>>>>>     stream: Seq[Int] = List(1, 2, 3, 4, 5)
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>     // Here: Fallback to a sane default when encountering
>>>>>>>>>>>>>>>>>> failed
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> records
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>     scala>     stream.map(x => Try(1/(3 - x))).flatMap(t =>
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Seq(t.getOrElse(42)))
>>>>>>>>>>>>>>>>>>     res19: Seq[Int] = List(0, 1, 42, -1, 0)
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>     // Here: Skip over failed records
>>>>>>>>>>>>>>>>>>     scala> stream.map(x => Try(1/(3 - x))).collect{ case
>>>>>>>>>>>>>>>>>> Success(s)
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> => s
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>     res20: Seq[Int] = List(0, 1, -1, 0)
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The above is more natural to me than using error handlers to
>>>>>>>>>>>>>>>>>> define
>>>>>>>>>>>>>>>>>> how
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> deal with failed records (here, the value `3` causes an
>>>>>>>>>>>>>>>>>> arithmetic
>>>>>>>>>>>>>>>>>> exception).  Again, it might help the KIP if we added an
>>>>>>>>>>>>>>>>>> end-to-end
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> example
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> for such user code errors.
>>>>>>>>>>>>>>>>>> ----------------------------snip--------------------------
>>>> --
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Tue, May 30, 2017 at 9:24 AM, Jan Filipiak <
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> jan.filip...@trivago.com>
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi Jay,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Eno mentioned that he will narrow down the scope to only
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> ConsumerRecord
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> deserialisation.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I am working with Database Changelogs only. I would really
>>>> not
>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>> a dead letter queue or something
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> similliar. how am I expected to get these back in order.
>>>> Just
>>>>>>>>>>>>>>>>>>> grind
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> hold an call me on the weekend. I'll fix it
>>>>>>>>>>>>>>>>>>> then in a few minutes rather spend 2 weeks ordering dead
>>>>>>>>>>>>>>>>>>> letters.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> (where
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> reprocessing might be even the faster fix)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Best Jan
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On 29.05.2017 20:23, Jay Kreps wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>     - I think we should hold off on retries unless we
>>>> have
>>>>>>>>>>>>>>>>>>> worked
>>>>>>>>>>>>>>>>>>> out
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>     full usage pattern, people can always implement their
>>>>>>>>>>>>>>>>>> own. I
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> the idea
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>     is that you send the message to some kind of dead
>>>>>>>>>>>>>>>>>>>> letter queue
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> then
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>     replay these later. This obviously destroys all
>>>> semantic
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> guarantees
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> we are
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>     working hard to provide right now, which may be okay.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> -- Guozhang
>>>> 
>>>> 
>>> 
>>> 
>>> -- 
>>> -- Guozhang
Re: [DISCUSS]: KIP-161: streams record processing exception handlers

Reply via email to