Hi Yi! I just figured it out yesterday and was about to send an update:) Yes, it covers our use case perfectly.
Thanks, Michael On Tue, Jan 19, 2016 at 5:15 AM, Yi Pan <nickpa...@gmail.com> wrote: > Hi, Michael, > > Your use case sounds much like a "customized checkpointing" to me. We have > similar cases in LinkedIn and the following are the solution in production: > 1) disable Samza auto-checkpoint by setting the commit_ms to -1 > 2) explicitly calling TaskCoordinator.commit() in sync with closing the > transaction batch > > The above procedure works well and gives user to ability to control the > commit of checkpoint together w/ your transaction batch. In case of system > crash between the closing of transaction batch and the checkpoint commit (I > am assuming this sequence of actions), we would follow the at-least-once > semantics and re-play the messages from the last commit. > > Please let us know whether that satisfies your use case. > > Thanks! > > -Yi > > On Sun, Jan 17, 2016 at 11:09 AM, Michael Sklyar <mikesk...@gmail.com> > wrote: > > > Hi, > > > > We have a Samza job reading messages from Kafka and inserting to hive via > > the Hive Streaming API. With Hive Streaming we are using > > "TransactionBatch", closing the Transaction batch closes the file on > HDFS. > > We close the transaction batch after reaching the a. Maximum messages per > > transaction batch or b. time threshold (for example - every 20K messages > or > > every 10 seconds). > > > > It works well, but in cases the job will terminate in the middle of a > > transaction batch we will have data inconsistency in hive, either: > > > > 1. Duplication: Data that was already inserted to hive will be processed > > again (since the checkpoint was taken earlier than the latest message > > written to hive). > > > > > > > > > > > > > > > > > > > > 2. Missing Data: Messages that were not committed to hive yet will not be > > reprocessed (since the checkpoint was written after > > > > What would be the recommended method of synchronizing hive/hdfs insertion > > with Samza checkpointing? I am thinking of overriding the > > *KafkaCheckpointManager* & *KafkaCheckpointManagerFactory* and > > synchronize check-pointing with > > committing the data to hive. Is it a good idea? > > > > Thanks in advance, > > Michael Sklyar > > >