Flume would likely do pretty well in that kind of scenario as well -- if you wanted to have a huge transaction size (2,000,000 events, I guess) as well as a very large capacity in the memory channel. You will also need to increase the timeout of the HDFS sink and disable serialization. At that point, you will likely be limited more by the speed of your Source than anything else.
However, depending on your use case this kind of benchmark is somewhat academic, right? I guess it depends on what kind of reliability semantics you are looking for and what serialization and other support you want. If your OS crashes or reboots, then you've potentially lost millions of events in your memory channel if you are storing tons of events there. So performance and reliability will always be a tradeoff. Even so, I will be the first to point out that we could do much better. And we are working on improving the performance of the various components in the system. Mike On May 8, 2012, at 11:35 AM, S Ahmed wrote: > I have played around with Kafka, and with a single producer (which I think in > flume terms is a flow) I can write about 200 000 messages a second that are > 200 bytes each. What it does it it writes in memory for 10 seconds and > flushes to disk. (this was on a dedicated box, with 4 gb of ram) > > This is just writing and not consuming (or sinking in flumes terms). > > Curious if there was a reason why flume does 70K/second. > > On Tue, May 8, 2012 at 1:36 PM, Mike Percy <mpe...@cloudera.com> wrote: > The "flow" terminology is something I could have defined better. I've updated > the config file fragment a bit in the example on the wiki to hopefully > clarify what was done: > https://cwiki.apache.org/confluence/display/FLUME/Flume+NG+Performance+Measurements#FlumeNGPerformanceMeasurements-SyslogPerformanceTest20120430 > > Generally when folks use the term "flow" as it relates to Flume it means an > independent Source, Channel, Sink combination, which do not interact with any > other Sources, Channels, or Sinks on the same agent. So each flow in this > case had its own SyslogTcpSource port, its own MemoryChannel, and its own > HDFSEventSink output path. > > Mike > > On May 8, 2012, at 9:38 AM, Arvind Prabhakar wrote: > >> Hi Ahmed, >> >> >3. 6 flume agent servers that collected data in memory and flushed to the >> >9 hadoop servers >> >> All tests were carried out with a single Flume Agent (which corresponds to >> one JVM process) on a single host. However, each measurement was made with a >> different number of flows passing through this agent. >> >> Thanks, >> Arvind >> >> >> On Tue, May 8, 2012 at 9:08 AM, S Ahmed <sahmed1...@gmail.com> wrote: >> Just want to make sure I understand the setup: >> >> 1. 9 hadoop servers that were fed the data >> 2. 1 server was used to generate the syslog data that was spread accross the >> 6 flume agent servers >> 3. 6 flume agent servers that collected data in memory and flushed to the 9 >> hadoop servers >> >> Is that right? >> >> >> On Tue, May 8, 2012 at 1:49 AM, Jarek Jarcec Cecho <jar...@apache.org> wrote: >> Thanks Mike, >> this is in deed very helpful! >> >> Jarcec >> >> On Mon, May 07, 2012 at 06:55:49PM -0700, Mike Percy wrote: >> > Hi folks, >> > Will McQueen and I have been doing some Flume NG stress and performance >> > testing, and we wanted to share some of our recent findings. The focus of >> > the most recent tests has been on the syslog TCP source, memory channel, >> > and HDFS sink. >> > >> > I wrote some software to generate load in syslog format over TCP and to >> > automate some of the analysis. The first thing we wanted to verify is that >> > no data was lost during these tests (a.k.a. correctness), with a close >> > second priority being of course throughput (performance). I used Pig and >> > AvroStorage from piggybank in the data integrity analysis, and committed >> > the compiled (0.11 trunk) piggybank jar so the load analysis scripts would >> > be relatively easy to use. It seems to be compatible with Pig 0.8.1. I am >> > a little wary of having to maintain that type of thing at the Apache org >> > level so for now I have checked all the code in on Github under an ASL 2.0 >> > license: >> > >> > https://github.com/mpercy/flume-load-gen >> > >> > I have created a Wiki page with the performance metrics we have come up >> > with so far. The executive summary is that at the time of this writing, we >> > have observed Flume NG on a single machine processing events at a >> > throughput rate of 70,000+ events/sec with no data loss. >> > >> > https://cwiki.apache.org/confluence/display/FLUME/Flume+NG+Performance+Measurements >> > >> > I have put more details on the wiki page itself. Please let me know if you >> > want me to add more detail. I'll be looking into improving the performance >> > of these components going forward, however we wanted to post these results >> > to set a public performance baseline of Flume NG. >> > >> > If others have done performance testing, we would love to see your results >> > if you can post the details. >> > >> > Regards, >> > Mike >> > >> >> > >