[ 
https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2048:
------------------------------
    Description: 
According to the 
[specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]:

bq. a string is encoded as a *long* followed by that many bytes of UTF-8 
encoded character data.

However, that is currently not being adhered to:

{code:title=org.apache.avro.io.BinaryDecoder}
  @Override
  public Utf8 readString(Utf8 old) throws IOException {
    int length = readInt();
    Utf8 result = (old != null ? old : new Utf8());
    result.setByteLength(length);
    if (0 != length) {
      doReadBytes(result.getBytes(), 0, length);
    }
    return result;
  }
{code}

The first thing the code does here is to load an *int* value, not a *long*.  
Because of the variable length nature of the size, this will mostly work.  
However, there may be edge-cases where the serializer is putting in large 
length values erroneously or nefariously. Let us gracefully detect such 
scenarios and more closely adhere to the spec.

  was:
According to the 
[specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]:

bq. a string is encoded as a *long* followed by that many bytes of UTF-8 
encoded character data.

However, that is currently not being adhered to:

{code:title=org.apache.avro.io.BinaryDecoder}
  @Override
  public Utf8 readString(Utf8 old) throws IOException {
    int length = readInt();
    Utf8 result = (old != null ? old : new Utf8());
    result.setByteLength(length);
    if (0 != length) {
      doReadBytes(result.getBytes(), 0, length);
    }
    return result;
  }
{code}

The first thing the code does here is to load an *int* value, not a *long*.  
Because of the variable length nature of the size, this will mostly work.  
However, there may be edge-cases where this is broken and the serializer is 
putting in large values erroneously or nefariously. Let us gracefully handle to 
detect such scenarios and more closely adhere to the spec.


> Avro Binary Decoding - Gracefully Handle Long Strings
> -----------------------------------------------------
>
>                 Key: AVRO-2048
>                 URL: https://issues.apache.org/jira/browse/AVRO-2048
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.7.7, 1.8.2
>            Reporter: BELUGA BEHR
>            Priority: Minor
>         Attachments: AVRO-2048.1.patch
>
>
> According to the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]:
> bq. a string is encoded as a *long* followed by that many bytes of UTF-8 
> encoded character data.
> However, that is currently not being adhered to:
> {code:title=org.apache.avro.io.BinaryDecoder}
>   @Override
>   public Utf8 readString(Utf8 old) throws IOException {
>     int length = readInt();
>     Utf8 result = (old != null ? old : new Utf8());
>     result.setByteLength(length);
>     if (0 != length) {
>       doReadBytes(result.getBytes(), 0, length);
>     }
>     return result;
>   }
> {code}
> The first thing the code does here is to load an *int* value, not a *long*.  
> Because of the variable length nature of the size, this will mostly work.  
> However, there may be edge-cases where the serializer is putting in large 
> length values erroneously or nefariously. Let us gracefully detect such 
> scenarios and more closely adhere to the spec.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to