Hi Vincent, In batch mode, with overwrite savemode, we can achieve exactly-once as we will simply overwrite if there are existing files, other than that, there is no guarantee since DF/DS/RDD doesn't maintain any checkpoints/WAL to know where it left before crash..
In Streaming mode, we will consider go further to guarantee exactly-once semantics with the help of check-pointing the offset/WAL, and introduce 'transactional' state to uniquely identify the current batch of data, and only write it out once (ignore if it already exists). Jihong -----Original Message----- From: vincent [mailto:[email protected]] Sent: Tuesday, September 27, 2016 7:11 AM To: [email protected] Subject: RE: carbondata and idempotence Hi thanks for your answer. My question is about both streaming and batch. Even in batch if a worker crash or if speculation is activated, the worker's task that failed will be relaunched on another worker. For example the worker has crashed after having ingested 20 000 lines on the 100 000 lines of the task, then the new worker will write the entire 100 000 lines and then resulting in 20 000 duplicated entries in the storage layer. This issue is generally managed by using primary key or transactions so the new task will override the 20 000 lines, or the transaction of the first 20 000 lines would be rolled back. -- View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/carbondata-and-idempotence-tp1416p1518.html Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.
