[
https://issues.apache.org/jira/browse/FLUME-983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ashish Paliwal resolved FLUME-983.
----------------------------------
Resolution: Won't Fix
Fix Version/s: v0.9.5
Won't fix. 0.X branch not maintained anymore
> snappy compression via AvroDataFileOutputFormat suboptimal
> ----------------------------------------------------------
>
> Key: FLUME-983
> URL: https://issues.apache.org/jira/browse/FLUME-983
> Project: Flume
> Issue Type: Bug
> Components: Sinks+Sources
> Affects Versions: v0.9.4
> Environment: Cloudera CDH3u2 flume + hadoop
> Reporter: Steve Hoffman
> Priority: Critical
> Fix For: v0.9.5
>
>
> I used the AvroDataFileOutputFormat with the snappy compression option to
> write compressed avro files to HDFS via flume.
> The original file was 106,514,936 bytes of json. The output is written to
> HDFS as raw (no flume wrapper).
> The file size I got using the snappy compression option was 47,520,735 bytes
> which is about 1/2 the size. Looking at the file directly it didn't look
> like it had been compressed too much.
> So I used avro-tools tojson to convert my final flume-written output back to
> json which resulted in a file size of 79,773,371 bytes (so this is basically
> the starting size of the data being compressed). Then I used the avro-tools
> fromjson, giving it the same schema as getschema returned, and the snappy
> compression option. The resulting file was 11,904,857 bytes (which seemed
> much better).
> So I asked myself, why the data written via flume record by record wasn't
> compressed as much? Looking at the raw file written to HDFS clearly showed
> 'snappy' in the header and the data looked minimally encoded/compressed.
> I looked at the source and was struck by a call to sink.flush() after the
> sink.append() in AvroDataFileOutputFormat.format().
> It appears that calling sink() was the root cause of the not so great
> compression.
> To test this theory, I recompiled the sink with the flush() line commented
> out. The resulting test wrote a file for the same sample data to be
> 11,870,573 bytes (pretty much matching the command-line avro-tools created
> version).
> I'm filing this 'cause I think this may be a bug wasting lots of space by
> users trying to use snappy compression (or any compression for that matter).
> Not really sure what the impact of removing this flush() call is either
> (since the file doesn't really exist in HDFS until it is closed).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)