[
https://issues.apache.org/jira/browse/AVRO-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15388004#comment-15388004
]
Piotr Wikieł edited comment on AVRO-1862 at 7/21/16 4:34 PM:
-------------------------------------------------------------
[~mike.hurley] let me explain you how I use it. I have a piece of stuff to
concatenate multiple small avro files (produced by Kafka-HDFS pipeline) into
one big file. It speeds up reading Hive partition with those files for about
~40%. You can set output compression as a parameter.
Kafka-HDFS ingestion tool we use is Camus - it stores avro files with deflate
codec which is (in one of few scenarios) also our output compression. We also
don't want to run concatenation for files within already concatenated
directories and, at the same time, support late records. The simplest way is to
rely on extension. Camus stores files with {{.avro}} extension, so we must use
different one (we calculate size of files with {{.deflate.avro}} extension and
if it is > 0, we run concatenation).
I know that there are many ways to achieve such a goal but I also think that
backward compatible, disabled by default feature that do not change everything
around (in code) could be accepted without a harm to the project because it
could be useful not only for me. But I don't have any problem if you don't
accept a patch if you convince me in some way ;)
If you want to see code of the tool I've mentioned, here it is:
https://github.com/allegro/camus-compressor
Cheers! :)
was (Author: wikp):
[~mike.hurley] let me explain you how I use it. I have a piece of stuff to
concatenate multiple small avro files (produced by Kafka-HDFS pipeline) into
one big file. It speeds up reading Hive partition with those files for about
~40%. You can set output compression as a parameter.
Kafka-HDFS ingestion tool we use is Camus - it stores avro files with deflate
codec which is (in one of few scenarios) also our output compression. We also
don't want to run concatenation for files within already concatenated
directories and, at the same time, support late records. The simplest way is to
rely on extension. Camus stores files with {{.avro}} extension, so we must use
different one (we calculate size of files with {{.deflate.avro}} extension and
if it is > 0, we run concatenation).
I know that there are many ways to achieve such a goal but I also think that
backward compatible, disabled by default feature that do not change everything
around (in code) could be accepted without a harm to the project because it
could be useful not only for me.
If you want to see code of the tool I've mentioned, here it is:
https://github.com/allegro/camus-compressor
Cheers! :)
> AvroOutputFormat saves compressed avrò files without respecting codec's
> default extension
> -----------------------------------------------------------------------------------------
>
> Key: AVRO-1862
> URL: https://issues.apache.org/jira/browse/AVRO-1862
> Project: Avro
> Issue Type: Improvement
> Components: java
> Reporter: Piotr Wikieł
> Priority: Minor
> Attachments: AVRO-1862-1.patch, AVRO-1862.patch
>
>
> Common pattern in naming compressed files is giving them extension derived
> from compression codec, for example: {{.gz}}, {{.zip}}, {{.bz2}}.
> {{AvroOutputFormat}} currently does not respect this convention.
> I've adapted some code from Hadoop's {{TextOutputFormat}} in
> backward-compatible manner adding following {{JobConf}} property:
> {{avro.mapred.output.extension.from-codec}} ({{boolean}}, default: {{false}})
> - when set to {{true}}, extension will be changed according to above rule.
> EDIT: Please take a look at first comment for an update. {{.gz.avro}},
> {{.snappy.avro}} will be an extension of the file when above property will be
> set to true.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)