We had one problem that would pop up out of nowhere...
https://groups.google.com/a/cloudera.org/group/flume-user/browse_thread/thread/66c6aecec9d1869b/a3110d1dfb9e1d0b?lnk=gst&q=static.void#a3110d1dfb9e1d0b
Another serious issue was when agents started to produce massive amounts
of data. For example, the logs produced by 1 machine was maybe
1mb/minute but when the agent was unable to communicate with any
collectors for what ever reason it would fill up with GB's of data
sitting in one of flumes subfolders (sent, sending, completed, etc).
Any links on how to create some real time analysis using kafka?
Thanks again
On 11/3/11 12:18 PM, Neha Narkhede wrote:
Mark,
First and foremost we are currently using RSylog to aggregate our logs from our
application servers.
This is similar to the legacy system we had at LinkedIn, now
successfully replaced by Kafka.
Although this strategy has been working for our bulk processing needs it
doen'st help us much with realtime analysis, something we would really like to
introduce.
Kafka is designed to efficiently feed both real time and offline data
pipelines. Being a pub-sub messaging system, it fits the need for
real-time applications well. Its high throughput nature and built-in
consumer parallelism features make it a good fit for feeding large
systems like Hadoop and data-warehouses. At LinkedIn, we use it for
activity tracking as well as real time RPC log analysis.
For more information, please visit our webpage -
http://incubator.apache.org/kafka/index.html. It has a detailed design
writeup, and quickstart for you to try it out.
We've tried Flume but that didn't work out too well.
I'm interested in knowing what roadblocks you hit while trying Flume
out, for curiosity sake ?
Thanks,
Neha
On Thu, Nov 3, 2011 at 11:58 AM, Mark<static.void....@gmail.com> wrote:
Neha thanks for the response.
I'll try and explain our use case. First and foremost we are currently using
RSylog to aggregate our logs from our application servers. This is
accomplished using their TCP plugin which sends logs to a cluster of logging
machines. At the end of the day we then import this into Hadoop. Although
this strategy has been working for our bulk processing needs it doen'st help
us much with realtime analysis, something we would really like to introduce.
We've tried Flume but that didn't work out too well. So now we are in the
process of looking into alternative technologies that can help us with both
our bulk and realtime analysis needs.
Does it sound like Kafka would be a nice fit for our use case? Are there any
examples, documentation on realtime analysis with Kafka?
Thanks.
On 11/3/11 11:37 AM, Neha Narkhede wrote:
Mark,
For activity on the mailing list, take a look at these metrics -
http://mail-archives.apache.org/mod_mbox/incubator-kafka-dev/
http://mail-archives.apache.org/mod_mbox/incubator-kafka-users/
For activity of the committers and the development -
https://issues.apache.org/jira/browse/KAFKA#selectedTab=com.atlassian.jira.plugin.system.project%3Aissues-panel
A full-fledged comparison can be quite lengthy. Would you mind
describing your case ? We can discuss the available alternatives and
how Kafka would fit in.
Kafka has been deployed in production at LinkedIn for over a year and
a half. I believe there are other smaller startups using it too, and
more in the pipeline.
Thanks,
Neha
On Thu, Nov 3, 2011 at 11:00 AM, Mark<static.void....@gmail.com> wrote:
I was wondering what the current state of Kafka is. Is it gaining much
traction? How active is the project, commiters and mailing lists? Are
there
other more popular alternatives out there? Any comparasion would help.
Thanks for any input.