File rollover and small files

Jeremy Custenborder Tue, 14 Feb 2012 11:17:58 -0800

Hello All,

I'm working on a POC using flume to aggregate our log files to s3
which will later be imported to hdfs and consumed by hive.  Here is my
problem. The web server I'm currently using for the POC is not pushing
much traffic, maybe 3 to 5 requests per second. This is resulting is a
huge number of small files on the web server. I have roll set for
900000 which I thought would generate a file every 15 minutes. I'm
getting files uploaded to s3 anywhere from 5 seconds to 50 seconds
apart, and they are pretty small too 600 bytes to. My goal is to have
at most 4 - 6 files per hour.


Web Server

source: tailDir("/var/log/apache2/site1/", fileregex="access.log")
sink:value("sitecode", "site1") value("subsitecode", "subsite1")
agentDFOSink("collector node",35853)

Collector node

source: collectorSource(35853)
sink: collector(35853) { webLogDecorator()  roll(900000) {
escapedFormatDfs("s3n://<valid s3
bucket>/hive/weblogs_live/dt=%Y-%m-%d/sitecode=%{sitecode}/subsitecode=%{subsitecode}/",
"file-%{rolltag}", seqfile("snappy"))}}

Here is what my config looks like.

  <property>
    <name>flume.collector.roll.millis</name>
    <value>900000</value>
    <description>The time (in milliseconds)
    between when hdfs files are closed and a new file is opened
    (rolled).
    </description>
  </property>
  <property>
    <name>flume.agent.logdir.retransmit</name>
    <value>2700000</value>
    <description>The time (in milliseconds)
    between when hdfs files are closed and a new file is opened
    (rolled).
    </description>
  </property>
  <property>
    <name>flume.agent.logdir.maxage</name>
    <value>450000</value>
    <description> number of milliseconds before a local log file is
    considered closed and ready to forward.
    </description>
  </property>

I have to be missing something? What am I doing wrong?

J

File rollover and small files

Reply via email to