[ 
https://issues.apache.org/jira/browse/FLUME-983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashish Paliwal resolved FLUME-983.
----------------------------------
       Resolution: Won't Fix
    Fix Version/s: v0.9.5

Won't fix. 0.X branch not maintained anymore

> snappy compression via AvroDataFileOutputFormat suboptimal
> ----------------------------------------------------------
>
>                 Key: FLUME-983
>                 URL: https://issues.apache.org/jira/browse/FLUME-983
>             Project: Flume
>          Issue Type: Bug
>          Components: Sinks+Sources
>    Affects Versions: v0.9.4
>         Environment: Cloudera CDH3u2 flume + hadoop
>            Reporter: Steve Hoffman
>            Priority: Critical
>             Fix For: v0.9.5
>
>
> I used the AvroDataFileOutputFormat with the snappy compression option to 
> write compressed avro files to HDFS via flume.
> The original file was 106,514,936 bytes of json.  The output is written to 
> HDFS as raw (no flume wrapper).
> The file size I got using the snappy compression option was 47,520,735 bytes 
> which is about 1/2 the size.  Looking at the file directly it didn't look 
> like it had been compressed too much.
> So I used avro-tools tojson to convert my final flume-written output back to 
> json which resulted in a file size of 79,773,371 bytes (so this is basically 
> the starting size of the data being compressed).  Then I used the avro-tools 
> fromjson, giving it the same schema as getschema returned, and the snappy 
> compression option.  The resulting file was 11,904,857 bytes (which seemed 
> much better).
> So I asked myself, why the data written via flume record by record wasn't 
> compressed as much?  Looking at the raw file written to HDFS clearly showed 
> 'snappy' in the header and the data looked minimally encoded/compressed.
> I looked at the source and was struck by a call to sink.flush() after the 
> sink.append() in AvroDataFileOutputFormat.format().
> It appears that calling sink() was the root cause of the not so great 
> compression.
> To test this theory, I recompiled the sink with the flush() line commented 
> out.  The resulting test wrote a file for the same sample data to be 
> 11,870,573 bytes (pretty much matching the command-line avro-tools created 
> version).
> I'm filing this 'cause I think this may be a bug wasting lots of space by 
> users trying to use snappy compression (or any compression for that matter).  
> Not really sure what the impact of removing this flush() call is either 
> (since the file doesn't really exist in HDFS until it is closed).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to