Hi Vincent,

In batch mode, with overwrite savemode, we can achieve exactly-once as we will 
simply overwrite if there are existing files, other than that, there is no 
guarantee since DF/DS/RDD doesn't maintain any checkpoints/WAL to know where it 
left before crash..

In Streaming mode, we will consider go further to guarantee exactly-once 
semantics with the help of check-pointing the offset/WAL, and introduce 
'transactional' state to uniquely identify the current batch of data, and only 
write it out once (ignore if it already exists).

Jihong

-----Original Message-----
From: vincent [mailto:[email protected]] 
Sent: Tuesday, September 27, 2016 7:11 AM
To: [email protected]
Subject: RE: carbondata and idempotence

Hi
thanks for your answer. My question is about both streaming and batch. Even
in batch if a worker crash or if speculation is activated, the worker's task
that failed will be relaunched on another worker. For example the worker has
crashed after having ingested 20 000 lines on the 100 000 lines of the task,
then the new worker will write the entire 100 000 lines and then resulting
in 20 000 duplicated entries in the storage layer.
This issue is generally managed by using primary key or transactions so the
new task will override the 20 000 lines, or the transaction of the first 20
000 lines would be rolled back.



--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/carbondata-and-idempotence-tp1416p1518.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.

Reply via email to