Thanks for sharing this Vlad, this is great read! I am particularly interested about the last bullet point of one-to-one mapping in MM since you also mentioned that you use Kafka MM as the async replication layer for your geo-replicated k-v store. One approach that we are pursuing here to support active-active is to use a aggregate cluster that mirror from multiple local clusters from different data centers. But this approach disables the one-to-one mapping since it requires multiple sources to pipe to a single destination. How did you tackle this problem at Turn if you are also using active-active?
Guozhang On Tue, Mar 24, 2015 at 12:18 PM, vlad...@gmail.com <vlad...@gmail.com> wrote: > Dear all, > > I had a short discussion with Jay yesterday at the ACM meetup and he > suggested writing an email regarding a few possible MirrorMaker > improvements. > > At Turn, we have been using MirrorMaker for a a few months now to > asynchronously replicate our key/value store data between our datacenters. > In a way, our system is similar to Linkedin's Databus, but it uses Kafka > clusters and MirrorMaker as its building blocks. Our overall message rate > peaks at about 650K/sec and, when pushing data over high bandwidth delay > product links, we have found some minor bottlenecks. > > The MirrorMaker process uses a standard consumer to pull data from a remote > datacenter. This implies that it opens a single TCP connection to each of > the remote brokers and muxes requests for different topics and partitions > over this connection. While this is a good thing in terms of maintaining > the congestion window open, over long RTT lines with rather high loss rate > the congestion window will cap, in our case at just a few Mbps. While the > overall line bandwidth is much higher, this means that we have to start > multiple MirrorMaker processes (somewhere in the hundreds), in order to > completely use the line capacity. Being able to pool multiple TCP > connections from a single consumer to a broker would solve this > complication. > > The standard consumer also uses the remote ZooKeeper in order to manage the > consumer group. While consumer group management is moving closer to the > brokers, it might make sense to move the group management to the local > datacenter, since that would avoid using the long-distance connection for > this purpose. > > Another possible improvement assumes a further constraint, namely that the > number of partitions for a topic in both datacenters is the same. In my > opinion, this is a sane constraint, since it preserves the Kafka ordering > guarantees (per partition), instead of a simple guarantee per key. This > kind of guarantee can be for example useful in a system that compares > partition contents to reach eventual consistency using Merkle trees. If the > number of partitions is equal, then offsets have the same meaning for the > same partition in both clusters, since the data for both partitions is > identical before the offset. This allows a simple consumer to just inquire > the local broker and the remote broker for their current offsets and, in > case the remote broker is ahead, copy the extra data to the local cluster. > Since the consumer offsets are no longer bound to the specific partitioning > of a single remote cluster, the consumer could pull from one of any number > of remote clusters, BitTorrent-style, if their offsets are ahead of the > local offset. The group management problem would reduce to assigning local > partitions to different MirrorMaker processes, so the group management > could be done locally also in this situation. > > Regards, > Vlad > > PS: Sorry if this is a double posting! The original posting did not appear > in the archives for a while. > -- -- Guozhang