We have an application that ingests data, funnels it through one server which then sprays it out to a distributed cluster. The data in the cluster is sharded and each shard is replicated twice in the cluster. Any given node has a few dozen shards and we are thinking that we want to increase the number of data shards so that we will have hundreds of such shards across the cluster at any given time. We need to get the data to the cluster with very low latency. Nodes can go down and shards can get redistributed, so we can't easily map between message and what nodes need it (since the consumer nodes may change between when the message was written and when it needs to be consumed).
We are evaluating Kafka for use in routing our incoming posts to the clustered servers. It looks like Kafka supports broker-side filtering (true?), but the API docs are a little sparse. Which would make more sense to implement: 1. Write to several hundred queues and have each clustered server read from a few dozen (either blocking on one thread per queue or timing out and round-robining netweem queues). 2. Write a much smaller number of queues and have each clustered server filter for the content they need. And of course the story wouldn't be complete if there weren't other consumers needing to read the same data but without the strict latency requirements. Any suggestions? Bob Jervis | Senior Architect [cid:image001.png@01CDA860.92B40310]<http://www.visibletechnologies.com/> Seattle | Boston | New York | London Phone: 425.957.6075 | Fax: 781.404.5711 Follow Visibly Intelligent Blog<http://www.visibletechnologies.com/blog/> [cid:image002.png@01CDA860.92B40310]<http://twitter.com/visible>[cid:image003.png@01CDA860.92B40310]<http://www.facebook.com/Visible.Technologies> [cid:image004.png@01CDA860.92B40310] <http://www.linkedin.com/company/visible-technologies>