>> One of the main issue they see with Kafka is that It requires connections >> from Consolidation Server to Kafka brokers and to Zookeeper daemons located >> in each “site”, versus connections from logs producers in all sites to the >> Consolidation servers.
When you say "site", do you mean data center ? If yes, then Kafka would be ideal since Kafka provides the ability to set up a cluster that can replicate data from several other clusters located in different data centers. Kafka has compression and batching features built in that can optimally use limited cross DC bandwidth. If you go down this route, you can set up a local Kafka and Zookeeper cluster each "site". Each "site" will have the producers send data to the local Kafka cluster. The Kafka cluster in the "site" hosting the consolidation servers will replicate data from every other "site". The consolidation servers then act as Kafka consumers pulling data from the local Kafka cluster and performing aggregate analysis for every site's data. The advantage of this solution over having the producers directly talk to the consolidation servers is essentially decoupling between producers and consumers. If the consumers can't keep up with the producers, the decoupling provides a persistent buffer that prevents your queues from overflowing and protects the consumers from being overloaded. Kafka, being horizontally scalable, allows you to scale out if the throughput requirements increase in the future. This might not be easy to do at the consolidation servers. In addition to this, the advantage of Kafka is that you can consume the same data multiple times as you find more applications in the future wanting to perform different analysis on the log data. One example of this is offline analytics using Hive/Pig. This is a huge win over the other solution that requires you to store multiple copies of the same data, which increases linearly with the number of consumer applications, essentially making it a very expensive solution. Thanks, Neha On Mon, Oct 22, 2012 at 5:17 AM, Sybrandy, Casey <casey.sybra...@six3systems.com> wrote: > With regards to security, you can always use stunnel to handle the encryption. > > -----Original Message----- > From: Jun Rao [mailto:jun...@gmail.com] > Sent: Sunday, October 21, 2012 5:45 PM > To: kafka-users@incubator.apache.org > Subject: Re: Kafka versus classic central HTTP(s) services for logs > transmission > > Jean, > > I understand your IT guys' concerns. It's true that Kafka is relatively new > and is not as widely adopted as some other conventional solutions. The > following are what I see as the main benefits of Kafka: > > a. Scalability: The system is designed to scale out. > b. Throughput: Kafka supports batch API and compression, which increase the > throughput of both producers and consumers. > c. Integration for both offline and near line consumption: With Kafka, you > can use a single system to load data into an offline system such as Hadoop as > well as to consume the data in real time. > d. Durability and availability: In the upcoming 0.8 release, Kafka will > support intra-cluster replication, which provides both higher durability and > availability at low cost. > > For your concern #2, in 0.8, the producer doesn't need Zookeeper any more. > Instead, if relies on an RPC to get topic metadata from the brokers. > > We haven't looked into security related features. However, if this is a > common requirement, we can add them in the future. > > Hope this is helpful. > > Thanks, > > Jun > > > On Sun, Oct 21, 2012 at 1:44 AM, Jean Bic <jean.b...@gmail.com> wrote: > >> Joe: >> >> Thanks for you answer, but we're trying to push Kafka Broker at each >> site... >> ... so your answer makes me realize why we're trying to push Kafka >> over per-producers services call: that would make a very large number >> of services call from each site (our logs producers gather data every >> 5 minutes, on average 100 items of about 128 bytes per machines, and >> we're targeting from 250 to 4000 machines per "site"). >> >> I think that, with these numbers, we have a way make IT people >> understand that Kafka solution will avoid flooding the site's firewall >> infrastructure (which is active for outbound connections). >> Beyond this good point for Kafka in terms of # of concurrent >> connections, I am wondering if we could find other assets for Kafka >> solution... >> >> Jean >> >> -----Original Message----- >> From: Joe Stein [mailto:crypt...@gmail.com] >> Sent: Sunday, October 21, 2012 1:26 AM >> To: kafka-users@incubator.apache.org >> Subject: Re: Kafka versus classic central HTTP(s) services for logs >> transmission >> >> You could move the producer code to the "site" and expose that as a >> REST interface. >> >> You can then benefit from the scale and consumer functionality that >> comes with Kafka without these issues you are bringing up. >> >> On Oct 20, 2012, at 4:27 PM, Jean Bic <jean.b...@gmail.com> wrote: >> >> > Hello, >> > >> > We have started to build a solution to gather logs from many >> > machines located in various "sites" to a so-called "Consolidation >> > server" which >> role >> > is to persists the logs and generate alerts based on some criteria >> > (patterns in logs, triggers on some values, etc). >> > >> > >> > We are challenged by our future users to clarify why Kafka is for >> > this >> need >> > the best possible communication solution. They argue that it would >> > be better to choose a more classic HTTP(S) based solution with >> > producers calling REST services on a pool of Node.js servers behind >> > a >> load-balancer. >> > >> > >> > One of the main issue they see with Kafka is that It requires >> connections >> > from Consolidation Server to Kafka brokers and to Zookeeper daemons >> located >> > in each "site", versus connections from logs producers in all sites >> > to >> the >> > Consolidation servers. >> > Here Kafka is seen as a burden for each site's IT team requiring >> > some firewall special setup, versus. no firewall setup with the >> > service-based solution : >> > >> > 1. Kafka requires for each site IT team to create firewall rules for >> > accepting incoming connections for a "non standard" protocol from >> > the "Collector server" site >> > >> > 2. IT team must expose all Zookeeper and Broker machines/ports to >> the >> > "Collector server" site >> > >> > 3. Kafka has no built-in encryption for data, where as a classic >> services >> > oriented solution can rely on HTTPS (reverse) proxies >> > >> > 4. Kafka is not commonly known by IT people who do not know how to >> > scale it: when should they add broker machines versus when should >> > they >> add >> > zookeeper machines? >> > >> > With the services-based solution, the IT teams of each site are free >> > of scalability issues, only on "Consolidation server" site one has >> > to add Node.js machine to scale up. >> > >> > I agree that these IT concerns can't be taken lightly. >> > >> > I need help from Kafka community to find rock solid assets for using >> Kafka >> > over classic services-based solution. >> > >> > How would you "defend" Kafka against above "attacks"? >> > >> > >> > Regards, >> > >> > Jean >>