exceptionfactory commented on PR #10053: URL: https://github.com/apache/nifi/pull/10053#issuecomment-3079434387
> > The ConsumeKafka Processor does different output FlowFiles when records have different associated Schemas. However, instead of comparing whether a Record Schema is compatible with Schema from the first Record, it groups output based on Record Schema equality. It still sends invalid records to a parse.failure Relationship, but it provides a logical grouping based on common Record Schema references. This fits the scenario of embedded schema references, and can also work for scenarios where schema inference produces different Record Schema results for different Records. This approach would also avoid some of the performance concerns related to evaluate a KinesisRecord multiple times, and avoids the need for changes to Record Writers. > > The downside of such approach is that offset tracking gets impossible without per-row sequence/sub-sequence numbers. Was this considered for in `ConsumeKafka`? Each Kafka `ConsumerRecord` has its own unique offset, so that works well in that scenario. Using the subsequence number for Kinesis sounds like it should provide the same level of tracking. I had initially raised a concern about adding the subsequence number in the context of addressing error handling, but considering the larger picture, including it to accomplish precise `KinesisRecord` tracking makes sense. With the subsequence number accounted for, do you see any other issues with Record Schema-based grouping? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
