[ 
https://issues.apache.org/jira/browse/AVRO-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924710#action_12924710
 ] 

Scott Carey commented on AVRO-684:
----------------------------------

Yes this would be useful.

Most of the machinery for this is already in the DataFileWriter class.  It is 
not exposed in a command-line tool though.

I currently use this machinery to take a large list of small avro files and 
merge them into one larger avro file with a set compression type and level.

In addition to the compression level, there is the concept of forcing a 
re-encode.  By default, the current code will not re-encode unless required.  
Therefore, it won't re-encode deflate:1 to deflate:3 by default unless told to 
by passing in the flag to force it to re-encode.  By default it will decode 
deflate to null or encode null to deflate.   If a block is already compatible, 
it just copies the raw bytes of the block, which is very fast.

This tool should also support concatenation of files and creation of one larger 
file from a collection of smaller ones (of the same schema) with the requested 
encoding.  Maybe something like this:

{noformat}
$ avro-tools append_to -f outfile.avro -c deflate:5 infile.avro [infile2.avro, 
. . .]
{noformat}

Which would create outfile.avro with codec deflate:5 form multiple source files.


> Java tool for altering the codec of an Avro data file stream.
> -------------------------------------------------------------
>
>                 Key: AVRO-684
>                 URL: https://issues.apache.org/jira/browse/AVRO-684
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Patrick Linehan
>
> An example is worth a thousand words:
>   cat infile.avro | avro-tools recodec deflate - - > outfile.avro
> The above example would create a new file, "outfile.avro", with the same 
> contents as "infile.avro".  However, the codec of "outfile.avro" would be 
> "deflate", regardless of the codec of "infile.avro".
> Proposed features:
> * The tool should preserve any metadata present in the input file.
> * Supported codecs will be "deflate" and "null".
> * Optionally add support for specifying the deflation level, perhaps with 
> syntax as follows:  "deflate:N" where N is the deflation level, e.g. 
> "deflate:4".
> Does this proposal sound reasonable?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to