[jira] [Commented] (AVRO-991) Allow combining multiple Avro files within a stream. (no files on disk)

Scott Carey (Commented) (JIRA) Mon, 16 Jan 2012 16:32:04 -0800

    [ 
https://issues.apache.org/jira/browse/AVRO-991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187335#comment-13187335
 ]


Scott Carey commented on AVRO-991:
----------------------------------

{quote}
A file starts with ASCII 'O'. Interpreted as a variable-length zig-zag encoded 
long, this is -40, which is an invalid item count. So a DataFileStream would 
need to, when the item count is -40, try to read a file header, and if its 
schema is compatible, update its sync and codec and keep reading.{quote}

* If reading sequentially this will work, but it means that the resulting 
concatenated file cannot be split.

I think the first thing we need to do is add a tool to avro-tools that can do 
the equivalent of 'cat file1.avro file2.avro > combined-file.avro'.  If the 
schemas are equal, this is extremely fast (blocks can be copied and new sync 
markers put between).  This requires no format change.   This same tool can be 
extended to 'recodec' or change the sync interval size.  It can also convert 
compatible schemas if need be.

I find that in most cases, if I have  a few hundred files that I want to lump 
up into fewer, if the result is one file per schema, I'd be happy.  IMO all we 
need is tool support for easy concatenation of same-schema files with some 
metadata preservation.
                
> Allow combining multiple Avro files within a stream. (no files on disk)
> -----------------------------------------------------------------------
>
>                 Key: AVRO-991
>                 URL: https://issues.apache.org/jira/browse/AVRO-991
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.1
>            Reporter: Frank Grimes
>
> It would be nice to be able to do as follows:
>   cat file1.avro file2.avro | java -jar avro-tools.jar streamcombine > 
> combined-file.avro
> or similarly
>   
>   hadoop dfs -cat hdfs://hadoop/file1.avro hdfs://hadoop/file2.avro | java 
> -jar avro-tools.jar streamcombine | hdfs -put - 
> hdfs://hadoop/combined-file.avro
> See the following thread for details: 
> http://mail-archives.apache.org/mod_mbox/avro-user/201201.mbox/%[email protected]%3e

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-991) Allow combining multiple Avro files within a stream. (no files on disk)

Reply via email to