[jira] [Updated] (FLUME-983) snappy compression via AvroDataFileOutputFormat suboptimal

Steve Hoffman (Updated) (JIRA) Tue, 21 Feb 2012 08:29:03 -0800

     [ 
https://issues.apache.org/jira/browse/FLUME-983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steve Hoffman updated FLUME-983:
--------------------------------

    Description: 
I used the AvroDataFileOutputFormat with the snappy compression option to write 
compressed avro files to HDFS via flume.
The original file was 106,514,936 bytes of json.  The output is written to HDFS 
as raw (no flume wrapper).
The file size I got using the snappy compression option was 47,520,735 bytes 
which is about 1/2 the size.  Looking at the file directly it didn't look like 
it had been compressed too much.

So I used avro-tools tojson to convert my final flume-written output back to 
json which resulted in a file size of 79,773,371 bytes (so this is basically 
the starting size of the data being compressed).  Then I used the avro-tools 
fromjson, giving it the same schema as getschema returned, and the snappy 
compression option.  The resulting file was 11,904,857 bytes (which seemed much 
better).

So I asked myself, why the data written via flume record by record wasn't 
compressed as much?  Looking at the raw file written to HDFS clearly showed 
'snappy' in the header and the data looked minimally encoded/compressed.

I looked at the source and was struck by a call to sink.flush() after the 
sink.append() in AvroDataFileOutputFormat.format().
It appears that calling sink() was the root cause of the not so great 
compression.

To test this theory, I recompiled the sink with the flush() line commented out. 
 The resulting test wrote a file for the same sample data to be 11,870,573 
bytes (pretty much matching the command-line avro-tools created version).

I'm filing this 'cause I think this may be a bug wasting lots of space by users 
trying to use snappy compression (or any compression for that matter).  Not 
really sure what the impact of removing this flush() call is either (since the 
file doesn't really exist in HDFS until it is closed).

  was:
I used the AvroDataFileOutputFormat with the snappy compression option to write 
compressed avro files to HDFS via flume.
The original file was 106514936 bytes of json.  The output is written to HDFS 
as raw (no flume wrapper).
The file size I got using the snappy compression option was 47520735 bytes 
which is about 1/2 the size.  Looking at the file directly it didn't look like 
it had been compressed too much.

So I used avro-tools tojson to convert my final flume-written output back to 
json which resulted in a file size of 79773371 bytes (so this is basically the 
starting size of the data being compressed).  Then I used the avro-tools 
fromjson, giving it the same schema as getschema returned, and the snappy 
compression option.  The resulting file was 11904857 bytes (which seemed much 
better).

So I asked myself, why the data written via flume record by record wasn't 
compressed as much?  Looking at the raw file written to HDFS clearly showed 
'snappy' in the header and the data looked minimally encoded/compressed.

I looked at the source and was struck by a call to sink.flush() after the 
sink.append() in AvroDataFileOutputFormat.format().
It appears that calling sink() was the root cause of the not so great 
compression.

To test this theory, I recompiled the sink with the flush() line commented out. 
 The resulting test wrote a file for the same sample data to be 11870573 bytes 
(pretty much matching the command-line avro-tools created version).

I'm filing this 'cause I think this may be a bug wasting lots of space by users 
trying to use snappy compression (or any compression for that matter).  Not 
really sure what the impact of removing this flush() call is either (since the 
file doesn't really exist in HDFS until it is closed).

    
> snappy compression via AvroDataFileOutputFormat suboptimal
> ----------------------------------------------------------
>
>                 Key: FLUME-983
>                 URL: https://issues.apache.org/jira/browse/FLUME-983
>             Project: Flume
>          Issue Type: Bug
>          Components: Sinks+Sources
>    Affects Versions: v0.9.4
>         Environment: Cloudera CDH3u2 flume + hadoop
>            Reporter: Steve Hoffman
>            Priority: Critical
>
> I used the AvroDataFileOutputFormat with the snappy compression option to 
> write compressed avro files to HDFS via flume.
> The original file was 106,514,936 bytes of json.  The output is written to 
> HDFS as raw (no flume wrapper).
> The file size I got using the snappy compression option was 47,520,735 bytes 
> which is about 1/2 the size.  Looking at the file directly it didn't look 
> like it had been compressed too much.
> So I used avro-tools tojson to convert my final flume-written output back to 
> json which resulted in a file size of 79,773,371 bytes (so this is basically 
> the starting size of the data being compressed).  Then I used the avro-tools 
> fromjson, giving it the same schema as getschema returned, and the snappy 
> compression option.  The resulting file was 11,904,857 bytes (which seemed 
> much better).
> So I asked myself, why the data written via flume record by record wasn't 
> compressed as much?  Looking at the raw file written to HDFS clearly showed 
> 'snappy' in the header and the data looked minimally encoded/compressed.
> I looked at the source and was struck by a call to sink.flush() after the 
> sink.append() in AvroDataFileOutputFormat.format().
> It appears that calling sink() was the root cause of the not so great 
> compression.
> To test this theory, I recompiled the sink with the flush() line commented 
> out.  The resulting test wrote a file for the same sample data to be 
> 11,870,573 bytes (pretty much matching the command-line avro-tools created 
> version).
> I'm filing this 'cause I think this may be a bug wasting lots of space by 
> users trying to use snappy compression (or any compression for that matter).  
> Not really sure what the impact of removing this flush() call is either 
> (since the file doesn't really exist in HDFS until it is closed).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (FLUME-983) snappy compression via AvroDataFileOutputFormat suboptimal

Reply via email to