Hi, Michael, Your use case sounds much like a "customized checkpointing" to me. We have similar cases in LinkedIn and the following are the solution in production: 1) disable Samza auto-checkpoint by setting the commit_ms to -1 2) explicitly calling TaskCoordinator.commit() in sync with closing the transaction batch
The above procedure works well and gives user to ability to control the commit of checkpoint together w/ your transaction batch. In case of system crash between the closing of transaction batch and the checkpoint commit (I am assuming this sequence of actions), we would follow the at-least-once semantics and re-play the messages from the last commit. Please let us know whether that satisfies your use case. Thanks! -Yi On Sun, Jan 17, 2016 at 11:09 AM, Michael Sklyar <mikesk...@gmail.com> wrote: > Hi, > > We have a Samza job reading messages from Kafka and inserting to hive via > the Hive Streaming API. With Hive Streaming we are using > "TransactionBatch", closing the Transaction batch closes the file on HDFS. > We close the transaction batch after reaching the a. Maximum messages per > transaction batch or b. time threshold (for example - every 20K messages or > every 10 seconds). > > It works well, but in cases the job will terminate in the middle of a > transaction batch we will have data inconsistency in hive, either: > > 1. Duplication: Data that was already inserted to hive will be processed > again (since the checkpoint was taken earlier than the latest message > written to hive). > > > > > > > > > > 2. Missing Data: Messages that were not committed to hive yet will not be > reprocessed (since the checkpoint was written after > > What would be the recommended method of synchronizing hive/hdfs insertion > with Samza checkpointing? I am thinking of overriding the > *KafkaCheckpointManager* & *KafkaCheckpointManagerFactory* and > synchronize check-pointing with > committing the data to hive. Is it a good idea? > > Thanks in advance, > Michael Sklyar >