[ 
https://issues.apache.org/jira/browse/AVRO-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15388004#comment-15388004
 ] 

Piotr Wikieł edited comment on AVRO-1862 at 7/21/16 4:49 PM:
-------------------------------------------------------------

[~mike.hurley] let me explain you how I use it. I have a piece of stuff to 
concatenate multiple small avro files (produced by Kafka-HDFS pipeline) into 
one big file. It speeds up reading Hive partition with those files for about 
~40%. You can set output compression as a parameter. 

Kafka-HDFS ingestion tool we use is Camus - it stores avro files with deflate 
codec which is (in one of few scenarios) also our output compression. We also 
don't want to run concatenation for files within already concatenated 
directories and, at the same time, support late records. The simplest way is to 
rely on extension. Camus stores files with {{.avro}} extension, so we must use 
different one (we calculate size of files with {{.deflate.avro}} extension and 
if it is > 0, we run concatenation). 

I know that there are many ways to achieve such a goal but I also think that 
backward compatible, disabled by default feature that do not change everything 
around (in code) could be accepted without a harm to the project because it 
could be useful not only for me. But I don't have any problem if you don't 
accept a patch if you convince me in some way ;)

If you want to see code of the tool I've mentioned, here it is: 
https://github.com/allegro/camus-compressor

I will be grateful if you tell me what to do to run continuous integration for 
my patch. This is my first one and I don't see it in "How to contribute" 
section in the project wiki. 

Cheers! :)


was (Author: wikp):
[~mike.hurley] let me explain you how I use it. I have a piece of stuff to 
concatenate multiple small avro files (produced by Kafka-HDFS pipeline) into 
one big file. It speeds up reading Hive partition with those files for about 
~40%. You can set output compression as a parameter. 

Kafka-HDFS ingestion tool we use is Camus - it stores avro files with deflate 
codec which is (in one of few scenarios) also our output compression. We also 
don't want to run concatenation for files within already concatenated 
directories and, at the same time, support late records. The simplest way is to 
rely on extension. Camus stores files with {{.avro}} extension, so we must use 
different one (we calculate size of files with {{.deflate.avro}} extension and 
if it is > 0, we run concatenation). 

I know that there are many ways to achieve such a goal but I also think that 
backward compatible, disabled by default feature that do not change everything 
around (in code) could be accepted without a harm to the project because it 
could be useful not only for me. But I don't have any problem if you don't 
accept a patch if you convince me in some way ;)

If you want to see code of the tool I've mentioned, here it is: 
https://github.com/allegro/camus-compressor

Cheers! :)

> AvroOutputFormat saves compressed avrò files without respecting codec's 
> default extension
> -----------------------------------------------------------------------------------------
>
>                 Key: AVRO-1862
>                 URL: https://issues.apache.org/jira/browse/AVRO-1862
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.8.1
>            Reporter: Piotr Wikieł
>            Priority: Minor
>              Labels: patch
>             Fix For: 1.8.2
>
>         Attachments: AVRO-1862-1.patch, AVRO-1862.patch
>
>
> Common pattern in naming compressed files is giving them extension derived 
> from compression codec, for example: {{.gz}}, {{.zip}}, {{.bz2}}. 
> {{AvroOutputFormat}} currently does not respect this convention. 
> I've adapted some code from Hadoop's {{TextOutputFormat}} in 
> backward-compatible manner adding following {{JobConf}} property:
> {{avro.mapred.output.extension.from-codec}} ({{boolean}}, default: {{false}}) 
> - when set to {{true}}, extension will be changed according to above rule.
> EDIT: Please take a look at first comment for an update. {{.gz.avro}}, 
> {{.snappy.avro}} will be an extension of the file when above property will be 
> set to true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to