hi all, I tested Flume in last week with ScribeSource( https://issues.apache.org/jira/browse/FLUME-1382) and HDFS Sink. More detailed conditions and deployment cases listed below. Too many 'Full GC' impact the throughput and amount of events promoted into old generation. I have applied some tuning methods, no much effect.
Could someone give me your feedback or tip to reduce the GC problem? Wish your attention. PS: Using Mike's report template at https://cwiki.apache.org/FLUME/flume-ng-performance-measurements.html * * *Flume Performance Test 2012-07-25* *Overview* The Flume agent was run on its own physical machine in a single JVM. A separate client machine generated load against the Flume box in List<LogEntry> format. Flume stored data onto a 4-node HDFS cluster configured on its own separate hardware. No virtual machines were used in this test. *Hardware specs* CPU: Inter Xeon L5640 2 x quad-core @ 2.27 GHz (12 physical cores) Memory: 16 GB OS: CentOS release 5.3 (Final) *Flume configuration* JAVA Version: 1.6.0_20 (Java HotSpot 64-Bit Server VM) JAVA OPTS: -Xms1024m -Xmx4096m -XX:PermSize=256m -XX:NewRatio=1 -XX:SurvivorRatio=5 -XX:InitialTenuringThreshold=15 -XX:MaxTenuringThreshold=31 -XX:PretenureSizeThreshold=4096 Num. agents: 1 Num. parallel flows: 5 Source: ScribeSource Channel: MemoryChannel Sink: HDFSEventSink Selector: RandomSelector *Config-file* # list sources, channels, sinks for the agent agent.sources = seqGenSrc agent.channels = mc1 mc2 mc3 mc4 mc5 agent.sinks = hdfsSin1 hdfsSin2 hdfsSin3 hdfsSin4 hdfsSin5 # define sources agent.sources.seqGenSrc.type = org.apache.flume.source.scribe.ScribeSource agent.sources.seqGenSrc.selector.type = io.flume.RandomSelector # define sinks agent.sinks.hdfsSin1.type = hdfs agent.sinks.hdfsSin1.hdfs.path = /flume_test/data1/ agent.sinks.hdfsSin1.hdfs.rollInterval = 300 agent.sinks.hdfsSin1.hdfs.rollSize = 0 agent.sinks.hdfsSin1.hdfs.rollCount = 1000000 agent.sinks.hdfsSin1.hdfs.batchSize = 10000 agent.sinks.hdfsSin1.hdfs.fileType = DataStream agent.sinks.hdfsSin1.hdfs.txnEventMax = 1000 # ... define sink #2 #3 #4 #5 ... # define channels agent.channels.mc1.type = memory agent.channels.mc1.capacity = 1000000 agent.channels.mc1.transactionCapacity = 1000 # ... define channel #2 #3 #4 #5 ... # specify the channel each sink and source should use agent.sources.seqGenSrc.channels = mc1 mc2 mc3 mc4 mc5 agent.sinks.hdfsSin1.channel = mc1 # ... specify sink #2 #3 #4 #5 ... *Hadoop configuration* The HDFS sink was connected to a 4-node Hadoop cluster running CDH3u1. For different HDFS sink, HDFS wrote data into different path. *Visualization of test setup* https://lh3.googleusercontent.com/dGumq1pu1Wr3Bj8WJmRHOoLWmUlGqxC4wW7_XCNO9R1wuh15LRXaKKxGoccpjBXtgqcdSVW-vtg There are 10 Scribe Clients and each client send 20 million LogEntry objects to ScribleSource. *Data description* List<LogEntry> entries containing a string category and a ByteArray body. The ByteArray body size is 500 bytes. *Results* Throughput: Average: Source: 46.4 MB/s, Sink: 45.2 MB/s Maximum: Source: 67.1 MB/s, Sink: 88.3 MB/s CPU: Average: 196%, Maximum: 440% GC: Young GC: 1636 times, Full GC: 384 times No data loss. *Heap and GC* By analyzing JVM Heap, we found that there are many LogEntry objects in OldGen. We have tried to carry out some optimizations, but the results are not satisfactory. We will continue to track this limitation. FullGC Log examples: [Full GC [PSYoungGen: 1497984K->0K(1797568K)] [PSOldGen: 1720643K->1693741K(2097152K)] 3218627K->1693741K(3894720K) [PSPermGen: 14566K->14566K(262144K)], 5.0027700 secs] [Times: user=5.01 sys=0.00, real=5.00 secs] [Full GC [PSYoungGen: 1497960K->0K(1797568K)] [PSOldGen: 1693805K->1752540K(2097152K)] 3191765K->1752540K(3894720K) [PSPermGen: 14571K->14571K(262144K)], 5.0732570 secs] [Times: user=5.07 sys=0.00, real=5.07 secs] [Full GC [PSYoungGen: 1497984K->0K(1797568K)] [PSOldGen: 1752540K->1642553K(2097152K)] 3250524K->1642553K(3894720K) [PSPermGen: 14572K->14568K(262144K)], 5.0710730 secs] [Times: user=5.07 sys=0.01, real=5.08 secs] -Regards Denny Ye
