Hello All, I'm working on a POC using flume to aggregate our log files to s3 which will later be imported to hdfs and consumed by hive. Here is my problem. The web server I'm currently using for the POC is not pushing much traffic, maybe 3 to 5 requests per second. This is resulting is a huge number of small files on the web server. I have roll set for 900000 which I thought would generate a file every 15 minutes. I'm getting files uploaded to s3 anywhere from 5 seconds to 50 seconds apart, and they are pretty small too 600 bytes to. My goal is to have at most 4 - 6 files per hour.
Web Server source: tailDir("/var/log/apache2/site1/", fileregex="access.log") sink:value("sitecode", "site1") value("subsitecode", "subsite1") agentDFOSink("collector node",35853) Collector node source: collectorSource(35853) sink: collector(35853) { webLogDecorator() roll(900000) { escapedFormatDfs("s3n://<valid s3 bucket>/hive/weblogs_live/dt=%Y-%m-%d/sitecode=%{sitecode}/subsitecode=%{subsitecode}/", "file-%{rolltag}", seqfile("snappy"))}} Here is what my config looks like. <property> <name>flume.collector.roll.millis</name> <value>900000</value> <description>The time (in milliseconds) between when hdfs files are closed and a new file is opened (rolled). </description> </property> <property> <name>flume.agent.logdir.retransmit</name> <value>2700000</value> <description>The time (in milliseconds) between when hdfs files are closed and a new file is opened (rolled). </description> </property> <property> <name>flume.agent.logdir.maxage</name> <value>450000</value> <description> number of milliseconds before a local log file is considered closed and ready to forward. </description> </property> I have to be missing something? What am I doing wrong? J