Re: File rollover and small files

Steve Hoffman Tue, 21 Feb 2012 08:28:00 -0800

Opened a bug: https://issues.apache.org/jira/browse/FLUME-983


Steve

On Mon, Feb 20, 2012 at 3:56 PM, Steve Hoffman <ste...@goofy.net> wrote:
> I noticed this too.
> I was using a custom sink based off the AvroDataFileOutputFormat.
> I think the problem comes from the sink.flush() call which calls
> sync() in DataFileWriter.
> This in turns calls writeBlock() which runs the data thru the compression 
> codec.
> Not sure where the logic goes wrong, but seems that calling flush()
> causes the amount of compressible data to be small so it doesn't
> really do much (I noticed about 1/2 size of the original).
>
> Just to verify the data wasn't as small as it could be, I used
> avro-tools tojson followed by a fromjson (with snappy codec and my
> schema had from getschema on avrotools) and notices the resulting file
> was about 1/10th the original.
>
> So, in my own output format, I removed that flush() line (in my
> format() after the sink.append()), which caused the size to come out
> almost the same as my tojson/fromjson experiment.
>
> So it seems that flush() should be removed, but I don't know what that
> will do to DFO/E2E flows (not that a flush really flushes to HDFS the
> way it makes it persistent on a POSIX filesystem).
>
> Can anybody more familiar with the code say what removing this flush()
> will do to the flume event processing/acks?  Is this a bug (do we need
> a jira)?
>
> Steve
>
> On Tue, Feb 14, 2012 at 1:17 PM, Jeremy Custenborder
> <jcustenbor...@gmail.com> wrote:
>> Hello All,
>>
>> I'm working on a POC using flume to aggregate our log files to s3
>> which will later be imported to hdfs and consumed by hive.  Here is my
>> problem. The web server I'm currently using for the POC is not pushing
>> much traffic, maybe 3 to 5 requests per second. This is resulting is a
>> huge number of small files on the web server. I have roll set for
>> 900000 which I thought would generate a file every 15 minutes. I'm
>> getting files uploaded to s3 anywhere from 5 seconds to 50 seconds
>> apart, and they are pretty small too 600 bytes to. My goal is to have
>> at most 4 - 6 files per hour.
>>
>> Web Server
>>
>> source: tailDir("/var/log/apache2/site1/", fileregex="access.log")
>> sink:value("sitecode", "site1") value("subsitecode", "subsite1")
>> agentDFOSink("collector node",35853)
>>
>> Collector node
>>
>> source: collectorSource(35853)
>> sink: collector(35853) { webLogDecorator()  roll(900000) {
>> escapedFormatDfs("s3n://<valid s3
>> bucket>/hive/weblogs_live/dt=%Y-%m-%d/sitecode=%{sitecode}/subsitecode=%{subsitecode}/",
>> "file-%{rolltag}", seqfile("snappy"))}}
>>
>> Here is what my config looks like.
>>
>>  <property>
>>    <name>flume.collector.roll.millis</name>
>>    <value>900000</value>
>>    <description>The time (in milliseconds)
>>    between when hdfs files are closed and a new file is opened
>>    (rolled).
>>    </description>
>>  </property>
>>  <property>
>>    <name>flume.agent.logdir.retransmit</name>
>>    <value>2700000</value>
>>    <description>The time (in milliseconds)
>>    between when hdfs files are closed and a new file is opened
>>    (rolled).
>>    </description>
>>  </property>
>>  <property>
>>    <name>flume.agent.logdir.maxage</name>
>>    <value>450000</value>
>>    <description> number of milliseconds before a local log file is
>>    considered closed and ready to forward.
>>    </description>
>>  </property>
>>
>> I have to be missing something? What am I doing wrong?
>>
>> J

Re: File rollover and small files

Reply via email to