Ashutosh, Good to hear about your interest in using and contributing to Kafka !
Please find some of the answers to your questions inline - >> 1- What happens in zookeeper goes down and comes back up---What messages if any do we lose...what compensation do we need to do on consumer side if any. The zookeeper clients in the broker and consumer will get disconnected from zookeeper. If the zookeeper cluster comes back up within the session timeout, all sessions will be restored. If it comes back after that, the sessions will expire and new sessions will get established. In any case, existing data on the brokers will not be lost. The consumers will just receive it late. >> 2- What happens when the broker goes down. a- when the hard drive has a failure. Without KAFKA-50, data will be lost. With KAFKA-50, the probability of data loss will significantly reduce, unless there are multiple correlated failures. >> b- data is correctly written to disk but the process goes down and is restarted. No data will be lost in this case. >> c- What will happen to consumers intermittently... If brokers/zookeeper is restarted, the consumers will just get the data once the cluster is restarted. No data will be lost >> d- If we replicate the data what reliability guarantees can we have. KAFKA-50 will add both sync as well as sync replication support in Kafka. For more details, see this - https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Replication >> e- If CRC errors happen, can we pick up the record from another copy saved somewhere Without KAFKA-50, data might get lost. With KAFKA-50, it will get served from the remaining replicas. >> why was queue not built on Cassandra? I think if you read the design document ( http://incubator.apache.org/kafka/design.html), most design choices will be easier to understand. Let us know if you have more questions after that. Thanks, Neha On Fri, Apr 27, 2012 at 1:22 PM, Ashutosh Singh <ashutoshvsi...@gmail.com>wrote: > Folks, > I read Jun's paper, but I could not get enough details on where all the > failure scenarios are at. I am trying to use kafka (or something else) for > persistent queues. The big requirement for me would also be not to loose > the message persisted. I am ready to contribute the code to add some sort > of replication, but want to know where all the failures can happen. > 1- What happens in zookeeper goes down and comes back up---What messages if > any do we lose...what compensation do we need to do on consumer side if > any. > 2- What happens when the broker goes down. > a- when the hard drive has a failure. > b- data is correctly written to disk but the process goes down and is > restarted. > c- What will happen to consumers intermittently... > d- If we replicate the data what reliability guarantees can we have. > e- If CRC errors happen, can we pick up the record from another copy > saved somewhere > > There is a deep interest in my group on this project and if it fits our > need we would like to run with it, both as users and as contributors. > > Another question.... why was queue not built on Cassandra? Would that have > met sub second latency SLA's > > Ashutosh Singh >