exceptionfactory commented on PR #10053:
URL: https://github.com/apache/nifi/pull/10053#issuecomment-3079434387

   > > The ConsumeKafka Processor does different output FlowFiles when records 
have different associated Schemas. However, instead of comparing whether a 
Record Schema is compatible with Schema from the first Record, it groups output 
based on Record Schema equality. It still sends invalid records to a 
parse.failure Relationship, but it provides a logical grouping based on common 
Record Schema references. This fits the scenario of embedded schema references, 
and can also work for scenarios where schema inference produces different 
Record Schema results for different Records. This approach would also avoid 
some of the performance concerns related to evaluate a KinesisRecord multiple 
times, and avoids the need for changes to Record Writers.
   > 
   > The downside of such approach is that offset tracking gets impossible 
without per-row sequence/sub-sequence numbers. Was this considered for in 
`ConsumeKafka`?
   
   Each Kafka `ConsumerRecord` has its own unique offset, so that works well in 
that scenario. Using the subsequence number for Kinesis sounds like it should 
provide the same level of tracking. I had initially raised a concern about 
adding the subsequence number in the context of addressing error handling, but 
considering the larger picture, including it to accomplish precise 
`KinesisRecord` tracking makes sense. With the subsequence number accounted 
for, do you see any other issues with Record Schema-based grouping?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to