Hi, We have a Samza job reading messages from Kafka and inserting to hive via the Hive Streaming API. With Hive Streaming we are using "TransactionBatch", closing the Transaction batch closes the file on HDFS. We close the transaction batch after reaching the a. Maximum messages per transaction batch or b. time threshold (for example - every 20K messages or every 10 seconds).
It works well, but in cases the job will terminate in the middle of a transaction batch we will have data inconsistency in hive, either: 1. Duplication: Data that was already inserted to hive will be processed again (since the checkpoint was taken earlier than the latest message written to hive). 2. Missing Data: Messages that were not committed to hive yet will not be reprocessed (since the checkpoint was written after What would be the recommended method of synchronizing hive/hdfs insertion with Samza checkpointing? I am thinking of overriding the *KafkaCheckpointManager* & *KafkaCheckpointManagerFactory* and synchronize check-pointing with committing the data to hive. Is it a good idea? Thanks in advance, Michael Sklyar