[ 
https://issues.apache.org/jira/browse/AVRO-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17482009#comment-17482009
 ] 

Ryan Skraba commented on AVRO-3307:
-----------------------------------

Hello!  If I understand correctly, the issue is that strictly according to the 
spec, the number {{5}} can be encoded as inefficiently as you like:

{code}
byte[] fiveSerialized = { 0x0A };
byte[] fiveSerializedTwoBytes = new byte[] {(byte) 0x8A, 0x00};
byte[] fiveSerializedNBytes = new byte[] {(byte) 0x8A, (byte) 0x80, (byte) 
0x80, .... , (byte) 0x80, 0x00};
{code}

I think this is already an implicit and important assumption and it would be OK 
to add to the spec for serialization.

[~emkornfield] What do you think about leaving deserialization flexible (that 
is, that readers can accept these inefficient variable length encodings if 
they're ever encountered, but writers must not generate them).  I'm reasonably 
certain that no SDK actually does generate them, and it's really kind of 
important for deterministic serialization.

> Specify that varints should be encoded shortest possible way
> ------------------------------------------------------------
>
>                 Key: AVRO-3307
>                 URL: https://issues.apache.org/jira/browse/AVRO-3307
>             Project: Apache Avro
>          Issue Type: Wish
>          Components: spec
>    Affects Versions: 1.11.0
>            Reporter: Askar Safin
>            Priority: Minor
>
> Spec is underspecified. It doesn't say whether non-canonical varint 
> serializations are allowed (i. e. whether it is okey to serialize number "5" 
> as two bytes). I propose to explicitly forbid such serializations. I. e. to 
> require reader to fail when reading such serialization. This will ensure (at 
> least for simple schemas) that equal values serialize to equal binary strings.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to