[
https://issues.apache.org/jira/browse/AVRO-991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187335#comment-13187335
]
Scott Carey commented on AVRO-991:
----------------------------------
{quote}
A file starts with ASCII 'O'. Interpreted as a variable-length zig-zag encoded
long, this is -40, which is an invalid item count. So a DataFileStream would
need to, when the item count is -40, try to read a file header, and if its
schema is compatible, update its sync and codec and keep reading.{quote}
* If reading sequentially this will work, but it means that the resulting
concatenated file cannot be split.
I think the first thing we need to do is add a tool to avro-tools that can do
the equivalent of 'cat file1.avro file2.avro > combined-file.avro'. If the
schemas are equal, this is extremely fast (blocks can be copied and new sync
markers put between). This requires no format change. This same tool can be
extended to 'recodec' or change the sync interval size. It can also convert
compatible schemas if need be.
I find that in most cases, if I have a few hundred files that I want to lump
up into fewer, if the result is one file per schema, I'd be happy. IMO all we
need is tool support for easy concatenation of same-schema files with some
metadata preservation.
> Allow combining multiple Avro files within a stream. (no files on disk)
> -----------------------------------------------------------------------
>
> Key: AVRO-991
> URL: https://issues.apache.org/jira/browse/AVRO-991
> Project: Avro
> Issue Type: Improvement
> Components: java
> Affects Versions: 1.6.1
> Reporter: Frank Grimes
>
> It would be nice to be able to do as follows:
> cat file1.avro file2.avro | java -jar avro-tools.jar streamcombine >
> combined-file.avro
> or similarly
>
> hadoop dfs -cat hdfs://hadoop/file1.avro hdfs://hadoop/file2.avro | java
> -jar avro-tools.jar streamcombine | hdfs -put -
> hdfs://hadoop/combined-file.avro
> See the following thread for details:
> http://mail-archives.apache.org/mod_mbox/avro-user/201201.mbox/%[email protected]%3e
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira