I have played around with Kafka, and with a single producer (which I think in flume terms is a flow) I can write about 200 000 messages a second that are 200 bytes each. What it does it it writes in memory for 10 seconds and flushes to disk. (this was on a dedicated box, with 4 gb of ram)
This is just writing and not consuming (or sinking in flumes terms). Curious if there was a reason why flume does 70K/second. On Tue, May 8, 2012 at 1:36 PM, Mike Percy <mpe...@cloudera.com> wrote: > The "flow" terminology is something I could have defined better. I've > updated the config file fragment a bit in the example on the wiki to > hopefully clarify what was done: > https://cwiki.apache.org/confluence/display/FLUME/Flume+NG+Performance+Measurements#FlumeNGPerformanceMeasurements-SyslogPerformanceTest20120430 > > Generally when folks use the term "flow" as it relates to Flume it means > an independent Source, Channel, Sink combination, which do not interact > with any other Sources, Channels, or Sinks on the same agent. So each flow > in this case had its own SyslogTcpSource port, its own MemoryChannel, and > its own HDFSEventSink output path. > > Mike > > On May 8, 2012, at 9:38 AM, Arvind Prabhakar wrote: > > Hi Ahmed, > > >3. 6 flume agent servers that collected data in memory and flushed to > the 9 hadoop servers > > All tests were carried out with a single Flume Agent (which corresponds to > one JVM process) on a single host. However, each measurement was made with > a different number of flows passing through this agent. > > Thanks, > Arvind > > > On Tue, May 8, 2012 at 9:08 AM, S Ahmed <sahmed1...@gmail.com> wrote: > >> Just want to make sure I understand the setup: >> >> 1. 9 hadoop servers that were fed the data >> 2. 1 server was used to generate the syslog data that was spread accross >> the 6 flume agent servers >> 3. 6 flume agent servers that collected data in memory and flushed to >> the 9 hadoop servers >> >> Is that right? >> >> >> On Tue, May 8, 2012 at 1:49 AM, Jarek Jarcec Cecho <jar...@apache.org>wrote: >> >>> Thanks Mike, >>> this is in deed very helpful! >>> >>> Jarcec >>> >>> On Mon, May 07, 2012 at 06:55:49PM -0700, Mike Percy wrote: >>> > Hi folks, >>> > Will McQueen and I have been doing some Flume NG stress and >>> performance testing, and we wanted to share some of our recent findings. >>> The focus of the most recent tests has been on the syslog TCP source, >>> memory channel, and HDFS sink. >>> > >>> > I wrote some software to generate load in syslog format over TCP and >>> to automate some of the analysis. The first thing we wanted to verify is >>> that no data was lost during these tests (a.k.a. correctness), with a close >>> second priority being of course throughput (performance). I used Pig and >>> AvroStorage from piggybank in the data integrity analysis, and committed >>> the compiled (0.11 trunk) piggybank jar so the load analysis scripts would >>> be relatively easy to use. It seems to be compatible with Pig 0.8.1. I am a >>> little wary of having to maintain that type of thing at the Apache org >>> level so for now I have checked all the code in on Github under an ASL 2.0 >>> license: >>> > >>> > https://github.com/mpercy/flume-load-gen >>> > >>> > I have created a Wiki page with the performance metrics we have come >>> up with so far. The executive summary is that at the time of this writing, >>> we have observed Flume NG on a single machine processing events at a >>> throughput rate of 70,000+ events/sec with no data loss. >>> > >>> > >>> https://cwiki.apache.org/confluence/display/FLUME/Flume+NG+Performance+Measurements >>> > >>> > I have put more details on the wiki page itself. Please let me know if >>> you want me to add more detail. I'll be looking into improving the >>> performance of these components going forward, however we wanted to post >>> these results to set a public performance baseline of Flume NG. >>> > >>> > If others have done performance testing, we would love to see your >>> results if you can post the details. >>> > >>> > Regards, >>> > Mike >>> > >>> >> >> > >