Data consistency and check-pointing

Michael Sklyar Sun, 17 Jan 2016 11:10:28 -0800

Hi,

We have a Samza job reading messages from Kafka and inserting to hive via
the Hive Streaming API. With Hive Streaming we are using
"TransactionBatch", closing the Transaction batch closes the file on HDFS.
We close the transaction batch after reaching the a. Maximum messages per
transaction batch or b. time threshold (for example - every 20K messages or
every 10 seconds).


It works well, but in cases the job will terminate in the middle of a
transaction batch we will have data inconsistency in hive, either:

1. Duplication: Data that was already inserted to hive will be processed
again (since the checkpoint was taken earlier than the latest message
written to hive).









2. Missing Data: Messages that were not committed to hive yet will not be
reprocessed (since the checkpoint was written after

What would be the recommended method of synchronizing hive/hdfs insertion
with Samza checkpointing? I am thinking of overriding the
*KafkaCheckpointManager* & *KafkaCheckpointManagerFactory* and
synchronize check-pointing with
committing the data to hive. Is it a good idea?

Thanks in advance,
Michael Sklyar

Data consistency and check-pointing

Reply via email to