Hi NiFi team,

The ConsumeKafka
<https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-kafka-2-6-nar/stable/org.apache.nifi.processors.kafka.pubsub.ConsumeKafka_2_6/index.html>
processor
currently offers at-least-once semantics through asynchronous offset
commits to Kafka, meaning that the data can be duplicated for example if a
cluster restart occurs before the offsets are acknowledged. One way to
guarantee exactly-once semantics (assuming no data loss in NiFi from
hardware failures) instead could be to persist some processor state within
NiFi, tracking the current offsets. But, this state management has to be
atomic with operations performed on the flowfile repository, therefore the
StateManager
<https://nifi.apache.org/nifi-docs/developer-guide.html#state_manager> is
unsuitable.

I was wondering whether it would be feasible, in a reliable way, to
retrieve such state from the latest flowfiles' attributes, presumably from
the provenance repository? Which I assume is atomic with / recovered from
the flowfile repository? I am not familiar with Lucene; I may be looking
for something similar to
WriteAheadProvenanceRepository::getLatestCachedEvents
<https://github.com/apache/nifi/blob/main/nifi-framework-bundle/nifi-framework-extensions/nifi-provenance-repository-bundle/nifi-persistent-provenance-repository/src/main/java/org/apache/nifi/provenance/WriteAheadProvenanceRepository.java#L260>,
but that would leverage up-to-date persisted information, rather than an
in-memory map that is lost upon cluster restart.

Thanks,

Reply via email to