dariuszseweryn commented on PR #10053: URL: https://github.com/apache/nifi/pull/10053#issuecomment-3108084521
There are two aspects that make me think: - flows that assume sequential FlowFile contents - humans auditing produced FlowFile contents completeness # Flows assuming sequential FlowFile contents All produced FlowFiles contain `aws.kinesis.sequence.number` of the last record. Given the flow uses FIFO or `aws.kinesis.sequence.number` as a prioritizer, subsequent FlowFiles for the same shard contained non-overlapping ranges of subsequent records under normal circumstances. Using grouping by default could break this assumption. To be backwards-compatible in this manner, the default strategy should be to close one FlowFile and start another when the schema changes and have grouping as an option. # Humans auditing produced FlowFile contents completeness I am considering auditability of processing completeness, as this processor does not work well with the Stateless Engine, i.e. it does not support Exactly Once semantics. Not all users will use the wrapping mechanism — having a way to determine/audit from the processed FlowFiles if all records were processed successfully could be useful. Up till now, FlowFiles contained sequential records minus those that could not be parsed and went with Parsing Failure relationship. Incremental check of sequential FlowFiles and counting records between sequence numbers was enough to verify completeness. With grouping it is harder to reason if all records were processed just by looking at subsequent FlowFiles attributes, as one would need to match the schema to find sequential FlowFiles and there is no guarantee how much back one would need to look for the previous FlowFile in sequence for a given schema. Easy ways for verification with grouping: 1. in wrapper mode — this is less of a problem, we have sequence/subsequence information on the processed records themselves 2. in non-wrapping mode — apart from having a sequence/subsequence number of the last record on each FlowFile, every FlowFile created from a single batch should have some identification of the batch it came from — sequence/subsequence number of the first record in the batch and count of FlowFiles produced by the batch. This would allow for easy grouping of FlowFiles produced by a batch and counting messages processed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
