[ 
https://issues.apache.org/jira/browse/KAFKA-15912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792933#comment-17792933
 ] 

Chris Egerton commented on KAFKA-15912:
---------------------------------------

+1 for the concerns about lack of thread safety in existing SMTs, and for not 
breaking stateful SMTs that rely on in-order record delivery.

I suppose we could still give each SMT/converter plugin (or a subset of them) a 
dedicated thread to work on. For example, in a source connector pipeline with 
two SMTs called "ValueToKey" and "ExtractField", and three converters for 
record keys, values, and headers, we could have something like this:

 

Thread 1: ValueToKey, ExtractField (in that order)

Thread 2: Header converter, key converter (in any order)

Thread 3: Value converter

 

Records would be delivered initially to the first thread, then passed to the 
second thread, then passed to the third, then back to the task thread (or, if 
we really want to get fancy, possibly dispatched directly to the producer).

This would allow up to three records to be processed at a time, though it would 
still be susceptible to hotspots (e.g., if there are no headers involved, the 
header converter step is basically a no-op, and traversing the entire record 
value for value conversion is likely to be the most CPU-intensive step). It's 
also unclear if this kind of limited parallelism would lead to much performance 
improvement on workers running multiple tasks; my suspicion is that the CPU 
would be pretty well-saturated on many of these already.

> Parallelize conversion and transformation steps in Connect
> ----------------------------------------------------------
>
>                 Key: KAFKA-15912
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15912
>             Project: Kafka
>          Issue Type: Improvement
>          Components: connect
>            Reporter: Mickael Maison
>            Priority: Major
>
> In busy Connect pipelines, the conversion and transformation steps can 
> sometimes have a very significant impact on performance. This is especially 
> true with large records with complex schemas, for example with CDC connectors 
> like Debezium.
> Today in order to always preserve ordering, converters and transformations 
> are called on one record at a time in a single thread in the Connect worker. 
> As Connect usually handles records in batches (up to max.poll.records in sink 
> pipelines, for source pipelines while it really depends on the connector, 
> most connectors I've seen still tend to return multiple records each loop), 
> it could be highly beneficial to attempt running the converters and 
> transformation chain in parallel by a pool a processing threads.
> It should be possible to do some of these steps in parallel and still keep 
> exact ordering. I'm even considering whether an option to lose ordering but 
> allow even faster processing would make sense.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to