Eric/Jun, Can you throw some light on how to handle apache log rotation? afaik, even if we write custom code to tail a file, the file handle is lost on rotation and might result in some loss of data.
On Thu, Sep 29, 2011 at 11:35 AM, Jeremy Hanna <jeremy.hanna1...@gmail.com> wrote: > Thanks a lot for the comparison Eric. Really good to hear a perspective from > a user of both. > > On Sep 29, 2011, at 1:25 PM, Eric Hauser wrote: > >> Jeremy, >> >> I've used both Flume and Kafka, and I can provide some info for comparison: >> >> Flume >> - The current Flume release 0.9.4 has some pretty nasty bugs in it >> (most have been fixed in trunk). >> - Flume is a more complex to maintain operations-wise (IMO) than Kafka >> since you have to setup masters and collectors (you don't necessarily >> need collectors if you aren't writing to HDFS) >> - Flume has a well defined pattern for doing what you want: >> http://www.cloudera.com/blog/2010/09/using-flume-to-collect-apache-2-web-server-logs/ >> >> Kafka >> - If you need multiple Kafka partitions for the logs, you will want to >> partition by host so the messages arrive in order for the same host >> - You can use the same piped technique as Flume to publish to Kafka, >> but you'll have to write a little code to publish and subscribe to the >> stream >> - Kafka does not provide any of the file rolling, compression, etc. >> that Flume provides >> - If you ever want to do anything more interesting with those log >> files than just send them to one location, publishing them to Kafka >> would allow you to add additional consumers later. Flume has a >> concept of fanout sinks, but I don't care for the way it works. >> >> >> >> On Thu, Sep 29, 2011 at 1:48 PM, Jun Rao <jun...@gmail.com> wrote: >>> Jeremy, >>> >>> Yes, Kafka will be a good fit for that. >>> >>> Thanks, >>> >>> Jun >>> >>> On Thu, Sep 29, 2011 at 10:12 AM, Jeremy Hanna >>> <jeremy.hanna1...@gmail.com>wrote: >>> >>>> We have a number of web servers in ec2 and periodically we just blow them >>>> away and create new ones. That makes keeping logs problematic. We're >>>> looking for a way to stream the logs from those various sources directly to >>>> a central log server - either just a single server or hdfs or something >>>> like >>>> that. >>>> >>>> My question is whether kafka is a good fit for that or should I be looking >>>> more along the lines of flume or scribe? >>>> >>>> Many thanks. >>>> >>>> Jeremy >>> > >