[jira] [Commented] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings
[ https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102826#comment-16102826 ] Gabor Szadovszky commented on AVRO-2048: Thanks, [~belugabehr], I'll wait another day for additional comments before committing. +1 > Avro Binary Decoding - Gracefully Handle Long Strings > - > > Key: AVRO-2048 > URL: https://issues.apache.org/jira/browse/AVRO-2048 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2048.1.patch, AVRO-2048.2.patch, AVRO-2048.3.patch > > > According to the > [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]: > bq. a string is encoded as a *long* followed by that many bytes of UTF-8 > encoded character data. > However, that is currently not being adhered to: > {code:title=org.apache.avro.io.BinaryDecoder} > @Override > public Utf8 readString(Utf8 old) throws IOException { > int length = readInt(); > Utf8 result = (old != null ? old : new Utf8()); > result.setByteLength(length); > if (0 != length) { > doReadBytes(result.getBytes(), 0, length); > } > return result; > } > {code} > The first thing the code does here is to load an *int* value, not a *long*. > Because of the variable length nature of the size, this will mostly work. > However, there may be edge-cases where the serializer is putting in large > length values erroneously or nefariously. Let us gracefully detect such > scenarios and more closely adhere to the spec. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings
[ https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101786#comment-16101786 ] BELUGA BEHR commented on AVRO-2048: --- [~gszadovszky] I borrowed same implementation from [here|http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/tip/src/share/classes/java/util/ArrayList.java#l190] > Avro Binary Decoding - Gracefully Handle Long Strings > - > > Key: AVRO-2048 > URL: https://issues.apache.org/jira/browse/AVRO-2048 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2048.1.patch, AVRO-2048.2.patch, AVRO-2048.3.patch > > > According to the > [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]: > bq. a string is encoded as a *long* followed by that many bytes of UTF-8 > encoded character data. > However, that is currently not being adhered to: > {code:title=org.apache.avro.io.BinaryDecoder} > @Override > public Utf8 readString(Utf8 old) throws IOException { > int length = readInt(); > Utf8 result = (old != null ? old : new Utf8()); > result.setByteLength(length); > if (0 != length) { > doReadBytes(result.getBytes(), 0, length); > } > return result; > } > {code} > The first thing the code does here is to load an *int* value, not a *long*. > Because of the variable length nature of the size, this will mostly work. > However, there may be edge-cases where the serializer is putting in large > length values erroneously or nefariously. Let us gracefully detect such > scenarios and more closely adhere to the spec. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings
[ https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101657#comment-16101657 ] Gabor Szadovszky commented on AVRO-2048: [~belugabehr], the actual string size (or any array size) usually cannot be {{Integer.MAX_VALUE}} as "_Some VMs reserve some header words in an array._". See e.g. {{java.util.ArrayList.MAX_ARRAY_SIZE}}. > Avro Binary Decoding - Gracefully Handle Long Strings > - > > Key: AVRO-2048 > URL: https://issues.apache.org/jira/browse/AVRO-2048 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2048.1.patch, AVRO-2048.2.patch > > > According to the > [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]: > bq. a string is encoded as a *long* followed by that many bytes of UTF-8 > encoded character data. > However, that is currently not being adhered to: > {code:title=org.apache.avro.io.BinaryDecoder} > @Override > public Utf8 readString(Utf8 old) throws IOException { > int length = readInt(); > Utf8 result = (old != null ? old : new Utf8()); > result.setByteLength(length); > if (0 != length) { > doReadBytes(result.getBytes(), 0, length); > } > return result; > } > {code} > The first thing the code does here is to load an *int* value, not a *long*. > Because of the variable length nature of the size, this will mostly work. > However, there may be edge-cases where the serializer is putting in large > length values erroneously or nefariously. Let us gracefully detect such > scenarios and more closely adhere to the spec. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings
[ https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16097362#comment-16097362 ] BELUGA BEHR commented on AVRO-2048: --- I can't explain why this is, but it seems to be a tad faster with this patch: {code} # Avro Master StringRead: 3291 ms 12.152 432.835 1780910 StringRead: 3290 ms 12.155 432.949 1780910 StringRead: 3287 ms 12.166 433.320 1780910 # Avro Master + Patch StringRead: 3270 ms 12.229 435.574 1780910 StringRead: 3288 ms 12.163 433.241 1780910 StringRead: 3271 ms 12.227 435.521 1780910 {code} > Avro Binary Decoding - Gracefully Handle Long Strings > - > > Key: AVRO-2048 > URL: https://issues.apache.org/jira/browse/AVRO-2048 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2048.1.patch, AVRO-2048.2.patch > > > According to the > [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]: > bq. a string is encoded as a *long* followed by that many bytes of UTF-8 > encoded character data. > However, that is currently not being adhered to: > {code:title=org.apache.avro.io.BinaryDecoder} > @Override > public Utf8 readString(Utf8 old) throws IOException { > int length = readInt(); > Utf8 result = (old != null ? old : new Utf8()); > result.setByteLength(length); > if (0 != length) { > doReadBytes(result.getBytes(), 0, length); > } > return result; > } > {code} > The first thing the code does here is to load an *int* value, not a *long*. > Because of the variable length nature of the size, this will mostly work. > However, there may be edge-cases where the serializer is putting in large > length values erroneously or nefariously. Let us gracefully detect such > scenarios and more closely adhere to the spec. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings
[ https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095807#comment-16095807 ] Suraj Acharya commented on AVRO-2048: - Seems like the code is not passing checkstyle. The artifacts are present here: https://builds.apache.org/job/PreCommit-AVRO-Build-TEST/32/artifact/component/patchprocess/test--lang_java.txt The value is {{0l}}. Change it to {{0L}} and it should pass. {code} [INFO] [INFO] --- maven-checkstyle-plugin:2.17:check (checkstyle-check) @ avro --- [INFO] Starting audit... /testptch/avro/lang/java/avro/src/main/java/org/apache/avro/io/BinaryDecoder.java:263:19: error: Should use uppercase 'L'. /testptch/avro/lang/java/avro/src/main/java/org/apache/avro/io/BinaryDecoder.java:268:10: error: Should use uppercase 'L'. Audit done. [INFO] There are 2 errors reported by Checkstyle 6.11.2 with checkstyle.xml ruleset. [ERROR] src/main/java/org/apache/avro/io/BinaryDecoder.java:[263,19] (misc) UpperEll: Should use uppercase 'L'. [ERROR] src/main/java/org/apache/avro/io/BinaryDecoder.java:[268,10] (misc) UpperEll: Should use uppercase 'L'. {code} > Avro Binary Decoding - Gracefully Handle Long Strings > - > > Key: AVRO-2048 > URL: https://issues.apache.org/jira/browse/AVRO-2048 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2048.1.patch > > > According to the > [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]: > bq. a string is encoded as a *long* followed by that many bytes of UTF-8 > encoded character data. > However, that is currently not being adhered to: > {code:title=org.apache.avro.io.BinaryDecoder} > @Override > public Utf8 readString(Utf8 old) throws IOException { > int length = readInt(); > Utf8 result = (old != null ? old : new Utf8()); > result.setByteLength(length); > if (0 != length) { > doReadBytes(result.getBytes(), 0, length); > } > return result; > } > {code} > The first thing the code does here is to load an *int* value, not a *long*. > Because of the variable length nature of the size, this will mostly work. > However, there may be edge-cases where the serializer is putting in large > length values erroneously or nefariously. Let us gracefully detect such > scenarios and more closely adhere to the spec. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings
[ https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095803#comment-16095803 ] Hadoop QA commented on AVRO-2048: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 13s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} buildtest {color} | {color:green} 0m 0s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} buildtest {color} | {color:red} 1m 36s{color} | {color:red} java in the patch failed. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 1m 52s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=1.13.1 Server=1.13.1 Image:yetus/avro:793178a | | JIRA Issue | AVRO-2048 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12877398/AVRO-2048.1.patch | | Optional Tests | buildtest javac | | uname | Linux 8139c24a523e 3.13.0-117-generic #164-Ubuntu SMP Fri Apr 7 11:05:26 UTC 2017 x86_64 GNU/Linux | | Build tool | build | | git revision | master / 793178a | | Default Java | 1.7.0_111 | | buildtest | https://builds.apache.org/job/PreCommit-AVRO-Build-TEST/32/artifact/patchprocess/test--lang_java.txt | | modules | C: lang/java U: lang/java | | Console output | https://builds.apache.org/job/PreCommit-AVRO-Build-TEST/32/console | | Powered by | Apache Yetus 0.4.0 http://yetus.apache.org | This message was automatically generated. > Avro Binary Decoding - Gracefully Handle Long Strings > - > > Key: AVRO-2048 > URL: https://issues.apache.org/jira/browse/AVRO-2048 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2048.1.patch > > > According to the > [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]: > bq. a string is encoded as a *long* followed by that many bytes of UTF-8 > encoded character data. > However, that is currently not being adhered to: > {code:title=org.apache.avro.io.BinaryDecoder} > @Override > public Utf8 readString(Utf8 old) throws IOException { > int length = readInt(); > Utf8 result = (old != null ? old : new Utf8()); > result.setByteLength(length); > if (0 != length) { > doReadBytes(result.getBytes(), 0, length); > } > return result; > } > {code} > The first thing the code does here is to load an *int* value, not a *long*. > Because of the variable length nature of the size, this will mostly work. > However, there may be edge-cases where the serializer is putting in large > length values erroneously or nefariously. Let us gracefully detect such > scenarios and more closely adhere to the spec. -- This message was sent by Atlassian JIRA (v6.4.14#64029)