Re: [DISCUSS] KIP-408: Add Asynchronous Processing to Kafka Streams

Matthias J. Sax Sun, 24 Nov 2019 16:49:17 -0800

Moved this KIP into status "inactive". Feel free to resume and any time.


-Matthias

On 1/4/19 3:42 PM, Richard Yu wrote:
> Hi all,
> 
> Just want to hear some opinions on this KIP from the PMCs. It would be nice
> if we got input from them.
> Don't want to drag this KIP for too long! :)
> 
> Hope we get some input :)
> 
> Thanks,
> Richard
> 
> On Thu, Jan 3, 2019 at 8:26 PM Richard Yu <[email protected]>
> wrote:
> 
>> Hi Boyang,
>>
>> Interesting article. Although something crossed my mind. When skipping bad
>> records, we couldn't go back to them to process again to guarantee ordering
>> i.e (both exactly-once and at-least-once would not be supported, only
>> at-most-once). Also, in Kafka, when it comes to individually acking every
>> single record, the resulting latency is horrible (from what I heard). We
>> actually discussed something like this in
>> https://issues.apache.org/jira/browse/KAFKA-7432. It might give you some
>> insight since it is a related issue.
>>
>> I hope this helps,
>> Richard
>>
>>
>>
>>
>> On Thu, Jan 3, 2019 at 7:29 PM Boyang Chen <[email protected]> wrote:
>>
>>> Hey Richard,
>>>
>>> thanks for the explanation. Recently I read an interesting blog post<
>>> https://streaml.io/blog/pulsar-streaming-queuing> from Apache Pulsar
>>> (written long time ago), where they define the concept of individual ack
>>> which means we could skip records and leave certain records remain on the
>>> queue for late processing. This should be something similar to KIP-408
>>> which also shares some motivations for us to invest.
>>>
>>> Boyang
>>>
>>> ________________________________
>>> From: Richard Yu <[email protected]>
>>> Sent: Friday, January 4, 2019 5:42 AM
>>> To: [email protected]
>>> Subject: Re: [DISCUSS] KIP-408: Add Asynchronous Processing to Kafka
>>> Streams
>>>
>>> Hi all,
>>>
>>> Just bumping this KIP. Would be great if we got some discussion.
>>>
>>>
>>> On Sun, Dec 30, 2018 at 5:13 PM Richard Yu <[email protected]>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I made some recent changes to the KIP. It should be more relevant with
>>> the
>>>> issue now (involves Processor API in detail).
>>>> It would be great if you could comment.
>>>>
>>>> Thanks,
>>>> Richard
>>>>
>>>> On Wed, Dec 26, 2018 at 10:01 PM Richard Yu <[email protected]
>>>>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Just changing the title of the KIP. Discovered it wasn't right.
>>>>> Thats about it. :)
>>>>>
>>>>> On Mon, Dec 24, 2018 at 7:57 PM Richard Yu <[email protected]
>>>>
>>>>> wrote:
>>>>>
>>>>>> Sorry, just making a correction.
>>>>>>
>>>>>> Even if we are processing records out of order, we will still have to
>>>>>> checkpoint offset ranges.
>>>>>> So it doesn't really change anything even if we are doing in-order
>>>>>> processing.
>>>>>>
>>>>>> Thinking this over, I'm leaning slightly towards maintaining the
>>>>>> ordering guarantee.
>>>>>> Although when implementing this change, there might be some kinks that
>>>>>> we have not thought about which could throw a monkey wrench into the
>>> works.
>>>>>>
>>>>>> But definitely worth trying out,
>>>>>> Richard
>>>>>>
>>>>>> On Mon, Dec 24, 2018 at 6:51 PM Richard Yu <
>>> [email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Boyang,
>>>>>>>
>>>>>>> I could see where you are going with this. Well, I suppose I should
>>>>>>> have added this to alternatives, but I might as well mention it now.
>>>>>>>
>>>>>>> It had crossed my mind that we consider returning in-order even if
>>>>>>> there are multiple threads processing on the same thread. But for
>>> this to
>>>>>>> happen, we must block for the offsets in-between which have not been
>>>>>>> processed yet. For example, offsets 1-50 are being processed by
>>> thread1,
>>>>>>> while the offsets 51 - 100 are being processed by thread2. We will
>>> have to
>>>>>>> wait for thread1 to finish processing its offsets first before we
>>> return
>>>>>>> the records processed by thread2. So in other words, once thread1 is
>>> done,
>>>>>>> thread2's work up to that point will be returned in one go, but not
>>> before
>>>>>>> that.
>>>>>>>
>>>>>>> I suppose this could work, but the client will have to wait some time
>>>>>>> before the advantages of multithreaded processing can be seen (i.e.
>>> the
>>>>>>> first thread has to finish processing its segment of the records
>>> first
>>>>>>> before any others are returned to guarantee ordering). Another point
>>> I
>>>>>>> would like to make is that the threads are *asynchronous. *So for us
>>>>>>> to know when a thread is done processing a certain segment, we will
>>>>>>> probably have a similar policy to how getMetadataAsync() works (i.e.
>>> have a
>>>>>>> parent thread be notified of when the children threads are done).
>>>>>>> [image: image.png]
>>>>>>> Just pulling this from the KIP. But instead, we would apply this to
>>>>>>> metadata segments instead of just a callback.
>>>>>>> I don't know whether or not the tradeoffs are acceptable to the
>>> client.
>>>>>>> Ordering could be guaranteed, but it would be hard to do. For
>>> example, if
>>>>>>> there was a crash, we might lose track of which offsets numbers and
>>> ranges
>>>>>>> we are processing for each child thread, so somehow we need to find
>>> a way
>>>>>>> to checkpoint those as well (like committing them to a Kafka topic).
>>>>>>>
>>>>>>> Let me know your thoughts on this approach. It would work, but the
>>>>>>> implementation details could be a mess.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Richard
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Dec 24, 2018 at 4:59 PM Boyang Chen <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Richard,
>>>>>>>>
>>>>>>>> thanks for the explanation! After some thinking, I do understand
>>> more
>>>>>>>> about this KIP. The motivation was to increase the throughput and
>>> put heavy
>>>>>>>> lifting RPC calls or IO operations to the background. While I feel
>>> the
>>>>>>>> ordering is hard to guarantee for async task, it is better to be
>>>>>>>> configurable for the end users.
>>>>>>>>
>>>>>>>> An example use case I could think of is: for every 500 records
>>>>>>>> processed, we need an RPC to external storage that takes
>>> non-trivial time,
>>>>>>>> and before its finishing all 499 records before it shouldn't be
>>> visible to
>>>>>>>> the end user. In such case, we need to have fine-grained control on
>>> the
>>>>>>>> visibility of downstream consumer so that our async task is
>>> planting a
>>>>>>>> barrier while still make 499 records non-blocking process and send
>>> to
>>>>>>>> downstream. So eventually when the heavy RPC is done, we commit
>>> this record
>>>>>>>> to remove the barrier and make all 500 records available for
>>> downstream. So
>>>>>>>> here we still need to guarantee the ordering within 500 records,
>>> while in
>>>>>>>> the same time consumer semantic has nothing to change.
>>>>>>>>
>>>>>>>> Am I making the point clear here? Just want have more discussion on
>>>>>>>> the ordering guarantee since I feel it wouldn't be a good idea to
>>> break
>>>>>>>> consumer ordering guarantee by default.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Boyang
>>>>>>>>
>>>>>>>> ________________________________
>>>>>>>> From: Richard Yu <[email protected]>
>>>>>>>> Sent: Saturday, December 22, 2018 9:08 AM
>>>>>>>> To: [email protected]
>>>>>>>> Subject: Re: KIP-408: Add Asynchronous Processing to Kafka Streams
>>>>>>>>
>>>>>>>> Hi Boyang,
>>>>>>>>
>>>>>>>> Thanks for pointing out the possibility of skipping bad records
>>> (never
>>>>>>>> crossed my mind). I suppose we could make it an option for the user
>>> if
>>>>>>>> they
>>>>>>>> could skip a bad record. It was never the intention of this KIP
>>> though
>>>>>>>> on
>>>>>>>> whether or not to do that. I could log a JIRA on such an issue, but
>>> I
>>>>>>>> think
>>>>>>>> this is out of the KIP's scope.
>>>>>>>>
>>>>>>>> As for the ordering guarantees, if you are using the standard Kafka
>>>>>>>> design
>>>>>>>> of one thread per task. Then everything will pretty much remain the
>>>>>>>> same.
>>>>>>>> However, if we are talking about using multiple threads per task
>>>>>>>> (which is
>>>>>>>> something that this KIP proposes), then we should probably expect
>>> the
>>>>>>>> behavior to be somewhat similar to Samza's Async Task as stated in
>>> the
>>>>>>>> JIRA
>>>>>>>> for this KIP (second-last comment).
>>>>>>>> Ordering would no longer be possible (so yeah, basically no
>>> guarantee
>>>>>>>> at
>>>>>>>> all).
>>>>>>>>
>>>>>>>> And how the user handles out-of-order messages is not something I'm
>>>>>>>> well
>>>>>>>> versed in. I guess they can try to put the messages back in order
>>> some
>>>>>>>> time
>>>>>>>> later on. But I honestly don't know what they will do.
>>>>>>>> It would be good if you could give me some insight into this.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Richard
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Dec 21, 2018 at 4:24 PM Boyang Chen <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks Richard for proposing this feature! We also have
>>> encountered
>>>>>>>> some
>>>>>>>>> similar feature request that we want to define a generic async
>>>>>>>> processing
>>>>>>>>> API<https://issues.apache.org/jira/browse/KAFKA-7566>.
>>>>>>>>>
>>>>>>>>> However I guess the motivation here is that we should skip big
>>>>>>>> records
>>>>>>>>> during normal processing, or let a separate task handle those
>>>>>>>> records who
>>>>>>>>> takes P99 processing time. Since my feeling is that if some edge
>>>>>>>> cases
>>>>>>>>> happen, could we just skip the bad record and continue processing
>>>>>>>> next
>>>>>>>>> record?
>>>>>>>>>
>>>>>>>>> Also I want to understand what kind of ordering guarantee we are
>>>>>>>> gonna
>>>>>>>>> provide with this new API, or there is no ordering guarantee at
>>>>>>>> all?  Could
>>>>>>>>> we discuss any potential issues if consumer needs to process
>>>>>>>> out-of-order
>>>>>>>>> messages?
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Boyang
>>>>>>>>> ________________________________
>>>>>>>>> From: Richard Yu <[email protected]>
>>>>>>>>> Sent: Saturday, December 22, 2018 2:00 AM
>>>>>>>>> To: [email protected]
>>>>>>>>> Subject: KIP-408: Add Asynchronous Processing to Kafka Streams
>>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> Lately, there has been considerable interest in adding
>>> asynchronous
>>>>>>>>> processing to Kafka Streams.
>>>>>>>>> Here is the KIP for such an addition:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-408%3A+Add+Asynchronous+Processing+To+Kafka+Streams
>>>>>>>>>
>>>>>>>>> I wish to discuss the best ways to approach this problem.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Richard Yu
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>
>>
>

signature.asc
Description: OpenPGP digital signature

Re: [DISCUSS] KIP-408: Add Asynchronous Processing to Kafka Streams

Reply via email to