Folks, I read Jun's paper, but I could not get enough details on where all the failure scenarios are at. I am trying to use kafka (or something else) for persistent queues. The big requirement for me would also be not to loose the message persisted. I am ready to contribute the code to add some sort of replication, but want to know where all the failures can happen. 1- What happens in zookeeper goes down and comes back up---What messages if any do we lose...what compensation do we need to do on consumer side if any. 2- What happens when the broker goes down. a- when the hard drive has a failure. b- data is correctly written to disk but the process goes down and is restarted. c- What will happen to consumers intermittently... d- If we replicate the data what reliability guarantees can we have. e- If CRC errors happen, can we pick up the record from another copy saved somewhere
There is a deep interest in my group on this project and if it fits our need we would like to run with it, both as users and as contributors. Another question.... why was queue not built on Cassandra? Would that have met sub second latency SLA's Ashutosh Singh