[ 
https://issues.apache.org/jira/browse/AVRO-991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187351#comment-13187351
 ] 

Scott Carey commented on AVRO-991:
----------------------------------

{quote}
For the record, the thinking behind the varied sync marker is that it makes 
collisions less likely. In theory this is not true, but in practice my concern 
was that, once a value was fixed and known, there'd be a significantly higher 
probability that someone would include it in some data. Perhaps that's not 
correct, though.{quote}

If the sync marker was known to have a few properties it would reduce the 
collision rate with typical Avro data with the 'null codec'
* It could contain a sequence of bytes that can not be interpreted as UTF8. 
(e.g. insufficient or too many continuation bytes)
* It could contain a sequence of bytes that can not be interpreted as an Avro 
encoded int or long.  (e.g. 10 consecutive bytes with the MSB set)

In order to achieve the above you lose some randomness, and we may have to 
compensate with a couple extra bytes.

For each codec, there may be a byte sequences that is impossible in the encoded 
data.  Each codec could have its own sync marker.  Files with incompatible 
codecs could not be concatenated together anyway.

                
> Allow combining multiple Avro files within a stream. (no files on disk)
> -----------------------------------------------------------------------
>
>                 Key: AVRO-991
>                 URL: https://issues.apache.org/jira/browse/AVRO-991
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.1
>            Reporter: Frank Grimes
>
> It would be nice to be able to do as follows:
>   cat file1.avro file2.avro | java -jar avro-tools.jar streamcombine > 
> combined-file.avro
> or similarly
>   
>   hadoop dfs -cat hdfs://hadoop/file1.avro hdfs://hadoop/file2.avro | java 
> -jar avro-tools.jar streamcombine | hdfs -put - 
> hdfs://hadoop/combined-file.avro
> See the following thread for details: 
> http://mail-archives.apache.org/mod_mbox/avro-user/201201.mbox/%[email protected]%3e

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to