[jira] [Assigned] (AVRO-2310) GenericData Array Add Use JDK Copy Facilities
[ https://issues.apache.org/jira/browse/AVRO-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR reassigned AVRO-2310: - Assignee: BELUGA BEHR > GenericData Array Add Use JDK Copy Facilities > - > > Key: AVRO-2310 > URL: https://issues.apache.org/jira/browse/AVRO-2310 > Project: Apache Avro > Issue Type: Improvement >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > > Use {{Arrays.copyOf}} to increase readability -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (AVRO-2310) GenericData Array Add Use JDK Copy Facilities
[ https://issues.apache.org/jira/browse/AVRO-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2310: -- Summary: GenericData Array Add Use JDK Copy Facilities (was: GenericData Array Add Use Latest Facilities) > GenericData Array Add Use JDK Copy Facilities > - > > Key: AVRO-2310 > URL: https://issues.apache.org/jira/browse/AVRO-2310 > Project: Apache Avro > Issue Type: Improvement >Reporter: BELUGA BEHR >Priority: Trivial > > Use {{Arrays.copyOf}} to increase readability -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AVRO-2310) GenericData Array Add Use Latest Facilities
BELUGA BEHR created AVRO-2310: - Summary: GenericData Array Add Use Latest Facilities Key: AVRO-2310 URL: https://issues.apache.org/jira/browse/AVRO-2310 Project: Apache Avro Issue Type: Improvement Reporter: BELUGA BEHR Use {{Arrays.copyOf}} to increase readability -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (AVRO-2309) Conversions.java - Use JDK Array Manipulations
[ https://issues.apache.org/jira/browse/AVRO-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2309: -- Summary: Conversions.java - Use JDK Array Manipulations (was: Conversions.java - Use Fill Method) > Conversions.java - Use JDK Array Manipulations > -- > > Key: AVRO-2309 > URL: https://issues.apache.org/jira/browse/AVRO-2309 > Project: Apache Avro > Issue Type: Improvement >Reporter: BELUGA BEHR >Priority: Trivial > > {code:java|title=Conversions.java} > for (int i = 0; i < bytes.length; i += 1) { > if (i < offset) { > bytes[i] = fillByte; > } else { > bytes[i] = unscaled[i - offset]; > } > } > {code} > Improve readability with {{Arrays.fill}} method to fill the first {{offset}} > bytes and then copy over the remaining. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AVRO-2308) Use JDK1.7 StandardCharsets
BELUGA BEHR created AVRO-2308: - Summary: Use JDK1.7 StandardCharsets Key: AVRO-2308 URL: https://issues.apache.org/jira/browse/AVRO-2308 Project: Apache Avro Issue Type: Improvement Reporter: BELUGA BEHR Use Java 1.7 [StandardCharsets|https://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html]. Every JDK must now include support for several common charsets. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AVRO-2061) Improve Invalid File Format Error Message
[ https://issues.apache.org/jira/browse/AVRO-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16118953#comment-16118953 ] BELUGA BEHR commented on AVRO-2061: --- [~nkollar] [~cutting] There ya go! Patch submitted. :) Thanks. > Improve Invalid File Format Error Message > - > > Key: AVRO-2061 > URL: https://issues.apache.org/jira/browse/AVRO-2061 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.9.0, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2061.1.patch, AVRO-2061.2.patch > > > {code:title=org.apache.avro.file.DataFileStream} > try { > vin.readFixed(magic); // read magic > } catch (IOException e) { > throw new IOException("Not a data file.", e); > } > if (!Arrays.equals(DataFileConstants.MAGIC, magic)) > throw new IOException("Not a data file."); > {code} > Please consider improving the error message here. I just saw a MapReduce job > fail with an IOException with the message "Not a data file." There was > definitely data files in the input directory, however, they were not Avro > files. It would have been much more helpful if it told me that. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2061) Improve Invalid File Format Error Message
[ https://issues.apache.org/jira/browse/AVRO-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2061: -- Attachment: AVRO-2061.2.patch > Improve Invalid File Format Error Message > - > > Key: AVRO-2061 > URL: https://issues.apache.org/jira/browse/AVRO-2061 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.9.0, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2061.1.patch, AVRO-2061.2.patch > > > {code:title=org.apache.avro.file.DataFileStream} > try { > vin.readFixed(magic); // read magic > } catch (IOException e) { > throw new IOException("Not a data file.", e); > } > if (!Arrays.equals(DataFileConstants.MAGIC, magic)) > throw new IOException("Not a data file."); > {code} > Please consider improving the error message here. I just saw a MapReduce job > fail with an IOException with the message "Not a data file." There was > definitely data files in the input directory, however, they were not Avro > files. It would have been much more helpful if it told me that. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (AVRO-1786) Strange IndexOutofBoundException in GenericDatumReader.readString
[ https://issues.apache.org/jira/browse/AVRO-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16114574#comment-16114574 ] BELUGA BEHR edited comment on AVRO-1786 at 8/4/17 4:17 PM: --- In regards to the original case, it is interesting to note that the Reducer failed while [deserializing the key|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/ReduceContextImpl.java#L142]. However, the Map output key is of type {{mergeJob.setMapOutputKeyClass(Contact2MergePartitionKey.class)}} which is not an {{AvroKey}} class. So, I'm not sure why the Avro deserializer is in play here at all. was (Author: belugabehr): In regards to the original case, it is interesting to note that the Reducer failed while [deserializing the key|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/ReduceContextImpl.java#L142]. However, the Map output key is of type {{mergeJob.setMapOutputKeyClass(Contact2MergePartitionKey.class)}} which is not an {{AvroKey}} class. > Strange IndexOutofBoundException in GenericDatumReader.readString > - > > Key: AVRO-1786 > URL: https://issues.apache.org/jira/browse/AVRO-1786 > Project: Avro > Issue Type: Bug > Components: java >Affects Versions: 1.7.4, 1.7.7 > Environment: CentOS 6.5 Linux x64, 2.6.32-358.14.1.el6.x86_64 > Use IBM JVM: > IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References > 20140515_199835 (JIT enabled, AOT enabled) >Reporter: Yong Zhang >Priority: Minor > > Our production cluster is CENTOS 6.5 (2.6.32-358.14.1.el6.x86_64), running > IBM BigInsight V3.0.0.2. In Apache term, it is Hadoop 2.2.0 with MRV1(no > yarn), and comes with AVRO 1.7.4, running with IBM J9 VM (build 2.7, JRE > 1.7.0 Linux amd64-64 Compressed References 20140515_199835 (JIT enabled, AOT > enabled). Not sure if the JDK matters, but it is NOT Oracle JVM. > We have a ETL implemented in a chain of MR jobs. In one MR job, it is going > to merge 2 sets of AVRO data. Dataset1 is in HDFS location A, and Dataset2 is > in HDFS location B, and both contains the AVRO records binding to the same > AVRO schema. The record contains an unique id field, and a timestamp field. > The MR job is to merge the records based on the ID, and use the later > timestamp record to replace previous timestamp record, and omit the final > AVRO record out. Very straightforward. > Now we faced a problem that one reducer keeps failing with the following > stacktrace on JobTracker: > {code} > java.lang.IndexOutOfBoundsException > at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:191) > at java.io.DataInputStream.read(DataInputStream.java:160) > at > org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:184) > at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263) > at > org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107) > at > org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:348) > at > org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:143) > at > org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:125) > at > org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:121) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154) > at > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139) > at > org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:108) > at > org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:48) > at > org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:142) > at > org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:117) > at > org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:297) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:165) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420) > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > at
[jira] [Commented] (AVRO-1786) Strange IndexOutofBoundException in GenericDatumReader.readString
[ https://issues.apache.org/jira/browse/AVRO-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16114574#comment-16114574 ] BELUGA BEHR commented on AVRO-1786: --- In regards to the original case, it is interesting to note that the Reducer failed while [deserializing the key|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/ReduceContextImpl.java#L142]. However, the Map output key is of type {{mergeJob.setMapOutputKeyClass(Contact2MergePartitionKey.class}} which is not an Avro class. > Strange IndexOutofBoundException in GenericDatumReader.readString > - > > Key: AVRO-1786 > URL: https://issues.apache.org/jira/browse/AVRO-1786 > Project: Avro > Issue Type: Bug > Components: java >Affects Versions: 1.7.4, 1.7.7 > Environment: CentOS 6.5 Linux x64, 2.6.32-358.14.1.el6.x86_64 > Use IBM JVM: > IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References > 20140515_199835 (JIT enabled, AOT enabled) >Reporter: Yong Zhang >Priority: Minor > > Our production cluster is CENTOS 6.5 (2.6.32-358.14.1.el6.x86_64), running > IBM BigInsight V3.0.0.2. In Apache term, it is Hadoop 2.2.0 with MRV1(no > yarn), and comes with AVRO 1.7.4, running with IBM J9 VM (build 2.7, JRE > 1.7.0 Linux amd64-64 Compressed References 20140515_199835 (JIT enabled, AOT > enabled). Not sure if the JDK matters, but it is NOT Oracle JVM. > We have a ETL implemented in a chain of MR jobs. In one MR job, it is going > to merge 2 sets of AVRO data. Dataset1 is in HDFS location A, and Dataset2 is > in HDFS location B, and both contains the AVRO records binding to the same > AVRO schema. The record contains an unique id field, and a timestamp field. > The MR job is to merge the records based on the ID, and use the later > timestamp record to replace previous timestamp record, and omit the final > AVRO record out. Very straightforward. > Now we faced a problem that one reducer keeps failing with the following > stacktrace on JobTracker: > {code} > java.lang.IndexOutOfBoundsException > at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:191) > at java.io.DataInputStream.read(DataInputStream.java:160) > at > org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:184) > at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263) > at > org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107) > at > org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:348) > at > org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:143) > at > org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:125) > at > org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:121) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154) > at > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139) > at > org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:108) > at > org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:48) > at > org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:142) > at > org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:117) > at > org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:297) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:165) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420) > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > at > java.security.AccessController.doPrivileged(AccessController.java:366) > at javax.security.auth.Subject.doAs(Subject.java:572) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1502) > at org.apache.hadoop.mapred.Child.main(Child.java:249) > {code} > Here is the my Mapper and Reducer methods: > Mapper: > public void map(AvroKey key, NullWritable value, Context > context) throws IOException, InterruptedException > Reducer: > protected void reduce(CustomPartitionKeyClass key, > Iterablevalues, Context context) throws > IOException, InterruptedException > What
[jira] [Comment Edited] (AVRO-1786) Strange IndexOutofBoundException in GenericDatumReader.readString
[ https://issues.apache.org/jira/browse/AVRO-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16114574#comment-16114574 ] BELUGA BEHR edited comment on AVRO-1786 at 8/4/17 4:16 PM: --- In regards to the original case, it is interesting to note that the Reducer failed while [deserializing the key|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/ReduceContextImpl.java#L142]. However, the Map output key is of type {{mergeJob.setMapOutputKeyClass(Contact2MergePartitionKey.class}} which is not an {{AvroKey}} class. was (Author: belugabehr): In regards to the original case, it is interesting to note that the Reducer failed while [deserializing the key|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/ReduceContextImpl.java#L142]. However, the Map output key is of type {{mergeJob.setMapOutputKeyClass(Contact2MergePartitionKey.class}} which is not an Avro class. > Strange IndexOutofBoundException in GenericDatumReader.readString > - > > Key: AVRO-1786 > URL: https://issues.apache.org/jira/browse/AVRO-1786 > Project: Avro > Issue Type: Bug > Components: java >Affects Versions: 1.7.4, 1.7.7 > Environment: CentOS 6.5 Linux x64, 2.6.32-358.14.1.el6.x86_64 > Use IBM JVM: > IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References > 20140515_199835 (JIT enabled, AOT enabled) >Reporter: Yong Zhang >Priority: Minor > > Our production cluster is CENTOS 6.5 (2.6.32-358.14.1.el6.x86_64), running > IBM BigInsight V3.0.0.2. In Apache term, it is Hadoop 2.2.0 with MRV1(no > yarn), and comes with AVRO 1.7.4, running with IBM J9 VM (build 2.7, JRE > 1.7.0 Linux amd64-64 Compressed References 20140515_199835 (JIT enabled, AOT > enabled). Not sure if the JDK matters, but it is NOT Oracle JVM. > We have a ETL implemented in a chain of MR jobs. In one MR job, it is going > to merge 2 sets of AVRO data. Dataset1 is in HDFS location A, and Dataset2 is > in HDFS location B, and both contains the AVRO records binding to the same > AVRO schema. The record contains an unique id field, and a timestamp field. > The MR job is to merge the records based on the ID, and use the later > timestamp record to replace previous timestamp record, and omit the final > AVRO record out. Very straightforward. > Now we faced a problem that one reducer keeps failing with the following > stacktrace on JobTracker: > {code} > java.lang.IndexOutOfBoundsException > at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:191) > at java.io.DataInputStream.read(DataInputStream.java:160) > at > org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:184) > at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263) > at > org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107) > at > org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:348) > at > org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:143) > at > org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:125) > at > org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:121) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154) > at > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139) > at > org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:108) > at > org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:48) > at > org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:142) > at > org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:117) > at > org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:297) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:165) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420) > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > at > java.security.AccessController.doPrivileged(AccessController.java:366) >
[jira] [Comment Edited] (AVRO-1786) Strange IndexOutofBoundException in GenericDatumReader.readString
[ https://issues.apache.org/jira/browse/AVRO-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16114574#comment-16114574 ] BELUGA BEHR edited comment on AVRO-1786 at 8/4/17 4:16 PM: --- In regards to the original case, it is interesting to note that the Reducer failed while [deserializing the key|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/ReduceContextImpl.java#L142]. However, the Map output key is of type {{mergeJob.setMapOutputKeyClass(Contact2MergePartitionKey.class)}} which is not an {{AvroKey}} class. was (Author: belugabehr): In regards to the original case, it is interesting to note that the Reducer failed while [deserializing the key|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/ReduceContextImpl.java#L142]. However, the Map output key is of type {{mergeJob.setMapOutputKeyClass(Contact2MergePartitionKey.class}} which is not an {{AvroKey}} class. > Strange IndexOutofBoundException in GenericDatumReader.readString > - > > Key: AVRO-1786 > URL: https://issues.apache.org/jira/browse/AVRO-1786 > Project: Avro > Issue Type: Bug > Components: java >Affects Versions: 1.7.4, 1.7.7 > Environment: CentOS 6.5 Linux x64, 2.6.32-358.14.1.el6.x86_64 > Use IBM JVM: > IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References > 20140515_199835 (JIT enabled, AOT enabled) >Reporter: Yong Zhang >Priority: Minor > > Our production cluster is CENTOS 6.5 (2.6.32-358.14.1.el6.x86_64), running > IBM BigInsight V3.0.0.2. In Apache term, it is Hadoop 2.2.0 with MRV1(no > yarn), and comes with AVRO 1.7.4, running with IBM J9 VM (build 2.7, JRE > 1.7.0 Linux amd64-64 Compressed References 20140515_199835 (JIT enabled, AOT > enabled). Not sure if the JDK matters, but it is NOT Oracle JVM. > We have a ETL implemented in a chain of MR jobs. In one MR job, it is going > to merge 2 sets of AVRO data. Dataset1 is in HDFS location A, and Dataset2 is > in HDFS location B, and both contains the AVRO records binding to the same > AVRO schema. The record contains an unique id field, and a timestamp field. > The MR job is to merge the records based on the ID, and use the later > timestamp record to replace previous timestamp record, and omit the final > AVRO record out. Very straightforward. > Now we faced a problem that one reducer keeps failing with the following > stacktrace on JobTracker: > {code} > java.lang.IndexOutOfBoundsException > at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:191) > at java.io.DataInputStream.read(DataInputStream.java:160) > at > org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:184) > at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263) > at > org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107) > at > org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:348) > at > org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:143) > at > org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:125) > at > org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:121) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154) > at > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139) > at > org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:108) > at > org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:48) > at > org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:142) > at > org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:117) > at > org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:297) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:165) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420) > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > at >
[jira] [Assigned] (AVRO-2061) Improve Invalid File Format Error Message
[ https://issues.apache.org/jira/browse/AVRO-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR reassigned AVRO-2061: - Assignee: BELUGA BEHR > Improve Invalid File Format Error Message > - > > Key: AVRO-2061 > URL: https://issues.apache.org/jira/browse/AVRO-2061 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.9.0, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2061.1.patch > > > {code:title=org.apache.avro.file.DataFileStream} > try { > vin.readFixed(magic); // read magic > } catch (IOException e) { > throw new IOException("Not a data file.", e); > } > if (!Arrays.equals(DataFileConstants.MAGIC, magic)) > throw new IOException("Not a data file."); > {code} > Please consider improving the error message here. I just saw a MapReduce job > fail with an IOException with the message "Not a data file." There was > definitely data files in the input directory, however, they were not Avro > files. It would have been much more helpful if it told me that. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2061) Improve Invalid File Format Error Message
[ https://issues.apache.org/jira/browse/AVRO-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2061: -- Status: Patch Available (was: Open) > Improve Invalid File Format Error Message > - > > Key: AVRO-2061 > URL: https://issues.apache.org/jira/browse/AVRO-2061 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.8.2, 1.9.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2061.1.patch > > > {code:title=org.apache.avro.file.DataFileStream} > try { > vin.readFixed(magic); // read magic > } catch (IOException e) { > throw new IOException("Not a data file.", e); > } > if (!Arrays.equals(DataFileConstants.MAGIC, magic)) > throw new IOException("Not a data file."); > {code} > Please consider improving the error message here. I just saw a MapReduce job > fail with an IOException with the message "Not a data file." There was > definitely data files in the input directory, however, they were not Avro > files. It would have been much more helpful if it told me that. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2061) Improve Invalid File Format Error Message
[ https://issues.apache.org/jira/browse/AVRO-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2061: -- Attachment: AVRO-2061.1.patch > Improve Invalid File Format Error Message > - > > Key: AVRO-2061 > URL: https://issues.apache.org/jira/browse/AVRO-2061 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.9.0, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2061.1.patch > > > {code:title=org.apache.avro.file.DataFileStream} > try { > vin.readFixed(magic); // read magic > } catch (IOException e) { > throw new IOException("Not a data file.", e); > } > if (!Arrays.equals(DataFileConstants.MAGIC, magic)) > throw new IOException("Not a data file."); > {code} > Please consider improving the error message here. I just saw a MapReduce job > fail with an IOException with the message "Not a data file." There was > definitely data files in the input directory, however, they were not Avro > files. It would have been much more helpful if it told me that. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AVRO-1786) Strange IndexOutofBoundException in GenericDatumReader.readString
[ https://issues.apache.org/jira/browse/AVRO-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112014#comment-16112014 ] BELUGA BEHR commented on AVRO-1786: --- [~cutting] [~rdblue] I've been tearing through the code of Avro and MapReduce to find a chink in the armor, but not luck. Any thoughts on where I can focus my search? > Strange IndexOutofBoundException in GenericDatumReader.readString > - > > Key: AVRO-1786 > URL: https://issues.apache.org/jira/browse/AVRO-1786 > Project: Avro > Issue Type: Bug > Components: java >Affects Versions: 1.7.4, 1.7.7 > Environment: CentOS 6.5 Linux x64, 2.6.32-358.14.1.el6.x86_64 > Use IBM JVM: > IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References > 20140515_199835 (JIT enabled, AOT enabled) >Reporter: Yong Zhang >Priority: Minor > > Our production cluster is CENTOS 6.5 (2.6.32-358.14.1.el6.x86_64), running > IBM BigInsight V3.0.0.2. In Apache term, it is Hadoop 2.2.0 with MRV1(no > yarn), and comes with AVRO 1.7.4, running with IBM J9 VM (build 2.7, JRE > 1.7.0 Linux amd64-64 Compressed References 20140515_199835 (JIT enabled, AOT > enabled). Not sure if the JDK matters, but it is NOT Oracle JVM. > We have a ETL implemented in a chain of MR jobs. In one MR job, it is going > to merge 2 sets of AVRO data. Dataset1 is in HDFS location A, and Dataset2 is > in HDFS location B, and both contains the AVRO records binding to the same > AVRO schema. The record contains an unique id field, and a timestamp field. > The MR job is to merge the records based on the ID, and use the later > timestamp record to replace previous timestamp record, and omit the final > AVRO record out. Very straightforward. > Now we faced a problem that one reducer keeps failing with the following > stacktrace on JobTracker: > {code} > java.lang.IndexOutOfBoundsException > at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:191) > at java.io.DataInputStream.read(DataInputStream.java:160) > at > org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:184) > at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263) > at > org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107) > at > org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:348) > at > org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:143) > at > org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:125) > at > org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:121) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154) > at > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139) > at > org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:108) > at > org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:48) > at > org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:142) > at > org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:117) > at > org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:297) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:165) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420) > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > at > java.security.AccessController.doPrivileged(AccessController.java:366) > at javax.security.auth.Subject.doAs(Subject.java:572) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1502) > at org.apache.hadoop.mapred.Child.main(Child.java:249) > {code} > Here is the my Mapper and Reducer methods: > Mapper: > public void map(AvroKey key, NullWritable value, Context > context) throws IOException, InterruptedException > Reducer: > protected void reduce(CustomPartitionKeyClass key, > Iterablevalues, Context context) throws > IOException, InterruptedException > What bother me are the following facts: > 1) All the mappers finish without error > 2) Most of the reducers finish without error, but one reducer keeps failing > with the above error. > 3) It looks like caused by the data? But keep in mind that all the avro > records passed the mapper side, but failed in one reducer. >
[jira] [Comment Edited] (AVRO-1786) Strange IndexOutofBoundException in GenericDatumReader.readString
[ https://issues.apache.org/jira/browse/AVRO-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16110083#comment-16110083 ] BELUGA BEHR edited comment on AVRO-1786 at 8/2/17 1:10 AM: --- I created a Fuzzer to test random combinations of values to try to reproduce this issue, but no luck. During my investigation, I hit [HADOOP-11678] because I forgot to set the Serializer to use {{org.apache.avro.hadoop.io.AvroSerialization}} and I was instead getting the default Hadoop Avro Serialization class of {{org.apache.hadoop.io.serializer.avro.AvroSerialization}} which is very broken because it's erroneously buffering data. However, during the time I was not setting the Serialization class correctly, I was getting some failures with stack traces that are very similar to the one reported here. So, it makes me think that there may be a buffer that isn't being flushed during serialization. Once i corrected the mistake, my Fuzzer ran for several hours and did not encounter any issues serializing or deserializing the random data. I was serializing the data directly to an [IFile|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/IFile.java] Writer and then using the IFile Reader to deserialize the back into objects. We are observing the same behaviors: # All Mappers finish without error # Most of the reducers finish without error, but one reducer keeps failing with the above error. Interestingly, our Map Input and Output are the same classes. That is to say, the Mappers filter and normalize the data, they don't create it. The {{AvroKey}} and {{AvroValue}} parameters passed to the Mapper are then returned by the Mapper. There are no {{AvroKey}} or {{AvroValue}} classes instantiated in the Mapper. They are passed directly through the Map method. This demonstrates that the entire data set can be read by the Avro serialization framework, but something goes wrong during the intermediate file stage. This may be an issue with how the Hadoop MapReduce framework interacts with Avro's library. was (Author: belugabehr): I created a Fuzzer to test random combinations of values to try to reproduce this issue, but no luck. During my investigation, I hit [HADOOP-11678] because I forgot to set the Serializer to use {{org.apache.avro.hadoop.io.AvroSerialization}} and I was instead getting the default Hadoop Avro Serialization class of {{org.apache.hadoop.io.serializer.avro.AvroSerialization}} which is very broken because it's erroneously buffering data. However, during the time I was not setting the Serialization class correctly, I was getting some very similar errors that is recorded here. So, it makes me think that there may be a buffer that isn't being flushed during serialization. Once i corrected the mistake, my Fuzzer ran for several hours and did not encounter any issues serializing or deserializing the random data. I was serializing the data directly to an [IFile|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/IFile.java] Reader and then using the IFile Reader to deserialize the back into objects. We are observing the same behaviors: # All Mappers finish without error # Most of the reducers finish without error, but one reducer keeps failing with the above error. Interestingly, our Map Input and Output are the same classes. That is to say, the Mappers filter and normalize the data, they don't create it. The {{AvroKey}} and {{AvroValue}} parameters passed to the Mapper are then returned by the Mapper. There are no {{AvroKey}} or {{AvroValue}} classes instantiated in the Mapper. They are passed directly through the Map method. This demonstrates that the entire data set can be read by the Avro serialization framework, but something goes wrong during the intermediate file stage. This may be an issue with how the Hadoop MapReduce framework interacts with Avro's library. > Strange IndexOutofBoundException in GenericDatumReader.readString > - > > Key: AVRO-1786 > URL: https://issues.apache.org/jira/browse/AVRO-1786 > Project: Avro > Issue Type: Bug > Components: java >Affects Versions: 1.7.4, 1.7.7 > Environment: CentOS 6.5 Linux x64, 2.6.32-358.14.1.el6.x86_64 > Use IBM JVM: > IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References > 20140515_199835 (JIT enabled, AOT enabled) >Reporter: Yong Zhang >Priority: Minor > > Our production cluster is CENTOS 6.5 (2.6.32-358.14.1.el6.x86_64), running > IBM BigInsight
[jira] [Comment Edited] (AVRO-1786) Strange IndexOutofBoundException in GenericDatumReader.readString
[ https://issues.apache.org/jira/browse/AVRO-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16110083#comment-16110083 ] BELUGA BEHR edited comment on AVRO-1786 at 8/2/17 1:08 AM: --- I created a Fuzzer to test random combinations of values to try to reproduce this issue, but no luck. During my investigation, I hit [HADOOP-11678] because I forgot to set the Serializer to use {{org.apache.avro.hadoop.io.AvroSerialization}} and I was instead getting the default Hadoop Avro Serialization class of {{org.apache.hadoop.io.serializer.avro.AvroSerialization}} which is very broken because it's erroneously buffering data. However, during the time I was not setting the Serialization class correctly, I was getting some very similar errors that is recorded here. So, it makes me think that there may be a buffer that isn't being flushed during serialization. Once i corrected the mistake, my Fuzzer ran for several hours and did not encounter any issues serializing or deserializing the random data. I was serializing the data directly to an [IFile|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/IFile.java] Reader and then using the IFile Reader to deserialize the back into objects. We are observing the same behaviors: # All Mappers finish without error # Most of the reducers finish without error, but one reducer keeps failing with the above error. Interestingly, our Map Input and Output are the same classes. That is to say, the Mappers filter and normalize the data, they don't create it. The {{AvroKey}} and {{AvroValue}} parameters passed to the Mapper are then returned by the Mapper. There are no {{AvroKey}} or {{AvroValue}} classes instantiated in the Mapper. They are passed directly through the Map method. This demonstrates that the entire data set can be read by the Avro serialization framework, but something goes wrong during the intermediate file stage. This may be an issue with how the Hadoop MapReduce framework interacts with Avro's library. was (Author: belugabehr): I created a Fuzzer to test random combinations of values to try to reproduce this issue, but no luck. During my investigation, I hit [HADOOP-11678] because I forgot to set the Serializer to use {{org.apache.avro.hadoop.io;AvroSerialization}} and I was instead getting the default Hadoop Avro Serialization class of {{org.apache.hadoop.io.serializer.avro.AvroSerialization}} which is very broken because it's erroneously buffering data. However, during the time I was not setting the Serialization class correctly, I was getting some very similar errors that is recorded here. So, it makes me think that there may be a buffer that isn't being flushed during serialization. Once i corrected the mistake, my Fuzzer ran for several hours and did not encounter any issues serializing or deserializing the random data. I was serializing the data directly to an [IFile|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/IFile.java] Reader and then using the IFile Reader to deserialize the back into objects. We are observing the same behaviors: # All Mappers finish without error # Most of the reducers finish without error, but one reducer keeps failing with the above error. Interestingly, our Map input and Output are the same classes. That is to say, the Mappers filter and normalize the data, they don't create it. The {{AvroKey}} and {{AvroValue}} parameters passed to the Mapper are then returned by the Mapper. There are no {{AvroKey}} or {{AvroValue}} classes instantiated in the Mapper. They are passed directly through the Map method. This demonstrates that the entire data set can be read by the Avro serialization framework, but something goes wrong during the intermediate file stage. This may be an issue with how the Hadoop MapReduce framework interacts with Avro's library. > Strange IndexOutofBoundException in GenericDatumReader.readString > - > > Key: AVRO-1786 > URL: https://issues.apache.org/jira/browse/AVRO-1786 > Project: Avro > Issue Type: Bug > Components: java >Affects Versions: 1.7.4, 1.7.7 > Environment: CentOS 6.5 Linux x64, 2.6.32-358.14.1.el6.x86_64 > Use IBM JVM: > IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References > 20140515_199835 (JIT enabled, AOT enabled) >Reporter: Yong Zhang >Priority: Minor > > Our production cluster is CENTOS 6.5 (2.6.32-358.14.1.el6.x86_64), running > IBM BigInsight V3.0.0.2. In Apache term, it is
[jira] [Commented] (AVRO-1786) Strange IndexOutofBoundException in GenericDatumReader.readString
[ https://issues.apache.org/jira/browse/AVRO-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16110083#comment-16110083 ] BELUGA BEHR commented on AVRO-1786: --- I created a Fuzzer to test random combinations of values to try to reproduce this issue, but no luck. During my investigation, I hit [HADOOP-11678] because I forgot to set the Serializer to use {{org.apache.avro.hadoop.io;AvroSerialization}} and I was instead getting the default Hadoop Avro Serialization class of {{org.apache.hadoop.io.serializer.avro.AvroSerialization}} which is very broken because it's erroneously buffering data. However, during the time I was not setting the Serialization class correctly, I was getting some very similar errors that is recorded here. So, it makes me think that there may be a buffer that isn't being flushed during serialization. Once i corrected the mistake, my Fuzzer ran for several hours and did not encounter any issues serializing or deserializing the random data. I was serializing the data directly to an [IFile|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/IFile.java] Reader and then using the IFile Reader to deserialize the back into objects. We are observing the same behaviors: # All Mappers finish without error # Most of the reducers finish without error, but one reducer keeps failing with the above error. Interestingly, our Map input and Output are the same classes. That is to say, the Mappers filter and normalize the data, they don't create it. The {{AvroKey}} and {{AvroValue}} parameters passed to the Mapper are then returned by the Mapper. There are no {{AvroKey}} or {{AvroValue}} classes instantiated in the Mapper. They are passed directly through the Map method. This demonstrates that the entire data set can be read by the Avro serialization framework, but something goes wrong during the intermediate file stage. This may be an issue with how the Hadoop MapReduce framework interacts with Avro's library. > Strange IndexOutofBoundException in GenericDatumReader.readString > - > > Key: AVRO-1786 > URL: https://issues.apache.org/jira/browse/AVRO-1786 > Project: Avro > Issue Type: Bug > Components: java >Affects Versions: 1.7.4, 1.7.7 > Environment: CentOS 6.5 Linux x64, 2.6.32-358.14.1.el6.x86_64 > Use IBM JVM: > IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References > 20140515_199835 (JIT enabled, AOT enabled) >Reporter: Yong Zhang >Priority: Minor > > Our production cluster is CENTOS 6.5 (2.6.32-358.14.1.el6.x86_64), running > IBM BigInsight V3.0.0.2. In Apache term, it is Hadoop 2.2.0 with MRV1(no > yarn), and comes with AVRO 1.7.4, running with IBM J9 VM (build 2.7, JRE > 1.7.0 Linux amd64-64 Compressed References 20140515_199835 (JIT enabled, AOT > enabled). Not sure if the JDK matters, but it is NOT Oracle JVM. > We have a ETL implemented in a chain of MR jobs. In one MR job, it is going > to merge 2 sets of AVRO data. Dataset1 is in HDFS location A, and Dataset2 is > in HDFS location B, and both contains the AVRO records binding to the same > AVRO schema. The record contains an unique id field, and a timestamp field. > The MR job is to merge the records based on the ID, and use the later > timestamp record to replace previous timestamp record, and omit the final > AVRO record out. Very straightforward. > Now we faced a problem that one reducer keeps failing with the following > stacktrace on JobTracker: > {code} > java.lang.IndexOutOfBoundsException > at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:191) > at java.io.DataInputStream.read(DataInputStream.java:160) > at > org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:184) > at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263) > at > org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107) > at > org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:348) > at > org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:143) > at > org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:125) > at > org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:121) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154) > at > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148) > at >
[jira] [Updated] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
[ https://issues.apache.org/jira/browse/AVRO-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2056: -- Status: Patch Available (was: In Progress) > DirectBinaryEncoder Creates Buffer For Each Call To writeDouble > --- > > Key: AVRO-2056 > URL: https://issues.apache.org/jira/browse/AVRO-2056 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.8.2, 1.7.7 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2056.1.patch > > > Each call to {{writeDouble}} creates a new buffer and promptly throws it away > even though the class has a re-usable buffer and is used in other methods > such as {{writeFloat}}. Remove this extra buffer. > {code:title=org.apache.avro.io.DirectBinaryEncoder} > // the buffer is used for writing floats, doubles, and large longs. > private final byte[] buf = new byte[12]; > @Override > public void writeFloat(float f) throws IOException { > int len = BinaryData.encodeFloat(f, buf, 0); > out.write(buf, 0, len); > } > @Override > public void writeDouble(double d) throws IOException { > byte[] buf = new byte[8]; > int len = BinaryData.encodeDouble(d, buf, 0); > out.write(buf, 0, len); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2054) Use StringBuilder instead of StringBuffer
[ https://issues.apache.org/jira/browse/AVRO-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2054: -- Status: Patch Available (was: In Progress) > Use StringBuilder instead of StringBuffer > - > > Key: AVRO-2054 > URL: https://issues.apache.org/jira/browse/AVRO-2054 > Project: Avro > Issue Type: Improvement >Affects Versions: 1.8.2, 1.7.7 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2054.1.patch, AVRO-2054.2.patch > > > Use the un-synchronized StringBuilder instead of StringBuffer. Use _char_ > values instead of Strings. > {code:title=org.apache.trevni.MetaData} > @Override public String toString() { > StringBuffer buffer = new StringBuffer(); > buffer.append("{ "); > for (Map.Entrye : entrySet()) { > buffer.append(e.getKey()); > buffer.append("="); > try { > buffer.append(new String(e.getValue(), "ISO-8859-1")); > } catch (java.io.UnsupportedEncodingException error) { > throw new TrevniRuntimeException(error); > } > buffer.append(" "); > } > buffer.append("}"); > return buffer.toString(); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings
[ https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2048: -- Status: Patch Available (was: In Progress) > Avro Binary Decoding - Gracefully Handle Long Strings > - > > Key: AVRO-2048 > URL: https://issues.apache.org/jira/browse/AVRO-2048 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.8.2, 1.7.7 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2048.1.patch, AVRO-2048.2.patch, AVRO-2048.3.patch > > > According to the > [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]: > bq. a string is encoded as a *long* followed by that many bytes of UTF-8 > encoded character data. > However, that is currently not being adhered to: > {code:title=org.apache.avro.io.BinaryDecoder} > @Override > public Utf8 readString(Utf8 old) throws IOException { > int length = readInt(); > Utf8 result = (old != null ? old : new Utf8()); > result.setByteLength(length); > if (0 != length) { > doReadBytes(result.getBytes(), 0, length); > } > return result; > } > {code} > The first thing the code does here is to load an *int* value, not a *long*. > Because of the variable length nature of the size, this will mostly work. > However, there may be edge-cases where the serializer is putting in large > length values erroneously or nefariously. Let us gracefully detect such > scenarios and more closely adhere to the spec. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer
[ https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2049: -- Status: Patch Available (was: In Progress) > Remove Superfluous Configuration From AvroSerializer > > > Key: AVRO-2049 > URL: https://issues.apache.org/jira/browse/AVRO-2049 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.8.2, 1.7.7 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2049.1.patch, AVRO-2049.2.patch, AVRO-2049.3.patch > > > In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the > Avro block size is configured with a hard-coded value and there is a request > to benchmark different buffer sizes. > {code:title=org.apache.avro.hadoop.io.AvroSerializer} > /** >* The block size for the Avro encoder. >* >* This number was copied from the AvroSerialization of > org.apache.avro.mapred in Avro 1.5.1. >* >* TODO(gwu): Do some benchmarking with different numbers here to see if it > is important. >*/ > private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512; > /** An factory for creating Avro datum encoders. */ > private static EncoderFactory mEncoderFactory > = new > EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES); > {code} > However, there is no need to benchmark, this setting is superfluous and is > ignored with the current implementation. > {code:title=org.apache.avro.hadoop.io.AvroSerializer} > @Override > public void open(OutputStream outputStream) throws IOException { > mOutputStream = outputStream; > mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder); > } > {code} > {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting. > This setting is only relevant for calls to > {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} > which considers the configured "Block Size" for doing binary encoding of > blocked Array types as laid out in the > [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex]. > It can simply be removed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2054) Use StringBuilder instead of StringBuffer
[ https://issues.apache.org/jira/browse/AVRO-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2054: -- Status: In Progress (was: Patch Available) > Use StringBuilder instead of StringBuffer > - > > Key: AVRO-2054 > URL: https://issues.apache.org/jira/browse/AVRO-2054 > Project: Avro > Issue Type: Improvement >Affects Versions: 1.8.2, 1.7.7 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2054.1.patch, AVRO-2054.2.patch > > > Use the un-synchronized StringBuilder instead of StringBuffer. Use _char_ > values instead of Strings. > {code:title=org.apache.trevni.MetaData} > @Override public String toString() { > StringBuffer buffer = new StringBuffer(); > buffer.append("{ "); > for (Map.Entrye : entrySet()) { > buffer.append(e.getKey()); > buffer.append("="); > try { > buffer.append(new String(e.getValue(), "ISO-8859-1")); > } catch (java.io.UnsupportedEncodingException error) { > throw new TrevniRuntimeException(error); > } > buffer.append(" "); > } > buffer.append("}"); > return buffer.toString(); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer
[ https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2049: -- Status: In Progress (was: Patch Available) > Remove Superfluous Configuration From AvroSerializer > > > Key: AVRO-2049 > URL: https://issues.apache.org/jira/browse/AVRO-2049 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.8.2, 1.7.7 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2049.1.patch, AVRO-2049.2.patch, AVRO-2049.3.patch > > > In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the > Avro block size is configured with a hard-coded value and there is a request > to benchmark different buffer sizes. > {code:title=org.apache.avro.hadoop.io.AvroSerializer} > /** >* The block size for the Avro encoder. >* >* This number was copied from the AvroSerialization of > org.apache.avro.mapred in Avro 1.5.1. >* >* TODO(gwu): Do some benchmarking with different numbers here to see if it > is important. >*/ > private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512; > /** An factory for creating Avro datum encoders. */ > private static EncoderFactory mEncoderFactory > = new > EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES); > {code} > However, there is no need to benchmark, this setting is superfluous and is > ignored with the current implementation. > {code:title=org.apache.avro.hadoop.io.AvroSerializer} > @Override > public void open(OutputStream outputStream) throws IOException { > mOutputStream = outputStream; > mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder); > } > {code} > {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting. > This setting is only relevant for calls to > {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} > which considers the configured "Block Size" for doing binary encoding of > blocked Array types as laid out in the > [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex]. > It can simply be removed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer
[ https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2049: -- Status: Patch Available (was: In Progress) > Remove Superfluous Configuration From AvroSerializer > > > Key: AVRO-2049 > URL: https://issues.apache.org/jira/browse/AVRO-2049 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.8.2, 1.7.7 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2049.1.patch, AVRO-2049.2.patch, AVRO-2049.3.patch > > > In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the > Avro block size is configured with a hard-coded value and there is a request > to benchmark different buffer sizes. > {code:title=org.apache.avro.hadoop.io.AvroSerializer} > /** >* The block size for the Avro encoder. >* >* This number was copied from the AvroSerialization of > org.apache.avro.mapred in Avro 1.5.1. >* >* TODO(gwu): Do some benchmarking with different numbers here to see if it > is important. >*/ > private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512; > /** An factory for creating Avro datum encoders. */ > private static EncoderFactory mEncoderFactory > = new > EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES); > {code} > However, there is no need to benchmark, this setting is superfluous and is > ignored with the current implementation. > {code:title=org.apache.avro.hadoop.io.AvroSerializer} > @Override > public void open(OutputStream outputStream) throws IOException { > mOutputStream = outputStream; > mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder); > } > {code} > {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting. > This setting is only relevant for calls to > {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} > which considers the configured "Block Size" for doing binary encoding of > blocked Array types as laid out in the > [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex]. > It can simply be removed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer
[ https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101827#comment-16101827 ] BELUGA BEHR commented on AVRO-2049: --- [~nkollar] Maybe so, we can reuse encoder. I'm not sure how often "open" is actually called though. New ticket? > Remove Superfluous Configuration From AvroSerializer > > > Key: AVRO-2049 > URL: https://issues.apache.org/jira/browse/AVRO-2049 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2049.1.patch, AVRO-2049.2.patch, AVRO-2049.3.patch > > > In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the > Avro block size is configured with a hard-coded value and there is a request > to benchmark different buffer sizes. > {code:title=org.apache.avro.hadoop.io.AvroSerializer} > /** >* The block size for the Avro encoder. >* >* This number was copied from the AvroSerialization of > org.apache.avro.mapred in Avro 1.5.1. >* >* TODO(gwu): Do some benchmarking with different numbers here to see if it > is important. >*/ > private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512; > /** An factory for creating Avro datum encoders. */ > private static EncoderFactory mEncoderFactory > = new > EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES); > {code} > However, there is no need to benchmark, this setting is superfluous and is > ignored with the current implementation. > {code:title=org.apache.avro.hadoop.io.AvroSerializer} > @Override > public void open(OutputStream outputStream) throws IOException { > mOutputStream = outputStream; > mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder); > } > {code} > {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting. > This setting is only relevant for calls to > {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} > which considers the configured "Block Size" for doing binary encoding of > blocked Array types as laid out in the > [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex]. > It can simply be removed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
[ https://issues.apache.org/jira/browse/AVRO-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2056: -- Status: In Progress (was: Patch Available) > DirectBinaryEncoder Creates Buffer For Each Call To writeDouble > --- > > Key: AVRO-2056 > URL: https://issues.apache.org/jira/browse/AVRO-2056 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.8.2, 1.7.7 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2056.1.patch > > > Each call to {{writeDouble}} creates a new buffer and promptly throws it away > even though the class has a re-usable buffer and is used in other methods > such as {{writeFloat}}. Remove this extra buffer. > {code:title=org.apache.avro.io.DirectBinaryEncoder} > // the buffer is used for writing floats, doubles, and large longs. > private final byte[] buf = new byte[12]; > @Override > public void writeFloat(float f) throws IOException { > int len = BinaryData.encodeFloat(f, buf, 0); > out.write(buf, 0, len); > } > @Override > public void writeDouble(double d) throws IOException { > byte[] buf = new byte[8]; > int len = BinaryData.encodeDouble(d, buf, 0); > out.write(buf, 0, len); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
[ https://issues.apache.org/jira/browse/AVRO-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101805#comment-16101805 ] BELUGA BEHR commented on AVRO-2056: --- [~gszadovszky] I'm sorry, but I do not understand your request to "add the Perf change." I did not change the Perf test to accomplish the testing other than un-comment the direct binary encoder: {code} Encoder e = encoder_factory.binaryEncoder(out, null); //Encoder e = encoder_factory.directBinaryEncoder(out, null); {code} > DirectBinaryEncoder Creates Buffer For Each Call To writeDouble > --- > > Key: AVRO-2056 > URL: https://issues.apache.org/jira/browse/AVRO-2056 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2056.1.patch > > > Each call to {{writeDouble}} creates a new buffer and promptly throws it away > even though the class has a re-usable buffer and is used in other methods > such as {{writeFloat}}. Remove this extra buffer. > {code:title=org.apache.avro.io.DirectBinaryEncoder} > // the buffer is used for writing floats, doubles, and large longs. > private final byte[] buf = new byte[12]; > @Override > public void writeFloat(float f) throws IOException { > int len = BinaryData.encodeFloat(f, buf, 0); > out.write(buf, 0, len); > } > @Override > public void writeDouble(double d) throws IOException { > byte[] buf = new byte[8]; > int len = BinaryData.encodeDouble(d, buf, 0); > out.write(buf, 0, len); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer
[ https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101798#comment-16101798 ] BELUGA BEHR commented on AVRO-2049: --- [~gszadovszky] Good catch. Thanks. Perhaps we should create a new ticket to remove the magic number. > Remove Superfluous Configuration From AvroSerializer > > > Key: AVRO-2049 > URL: https://issues.apache.org/jira/browse/AVRO-2049 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2049.1.patch, AVRO-2049.2.patch, AVRO-2049.3.patch > > > In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the > Avro block size is configured with a hard-coded value and there is a request > to benchmark different buffer sizes. > {code:title=org.apache.avro.hadoop.io.AvroSerializer} > /** >* The block size for the Avro encoder. >* >* This number was copied from the AvroSerialization of > org.apache.avro.mapred in Avro 1.5.1. >* >* TODO(gwu): Do some benchmarking with different numbers here to see if it > is important. >*/ > private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512; > /** An factory for creating Avro datum encoders. */ > private static EncoderFactory mEncoderFactory > = new > EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES); > {code} > However, there is no need to benchmark, this setting is superfluous and is > ignored with the current implementation. > {code:title=org.apache.avro.hadoop.io.AvroSerializer} > @Override > public void open(OutputStream outputStream) throws IOException { > mOutputStream = outputStream; > mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder); > } > {code} > {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting. > This setting is only relevant for calls to > {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} > which considers the configured "Block Size" for doing binary encoding of > blocked Array types as laid out in the > [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex]. > It can simply be removed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer
[ https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2049: -- Attachment: AVRO-2049.3.patch > Remove Superfluous Configuration From AvroSerializer > > > Key: AVRO-2049 > URL: https://issues.apache.org/jira/browse/AVRO-2049 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2049.1.patch, AVRO-2049.2.patch, AVRO-2049.3.patch > > > In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the > Avro block size is configured with a hard-coded value and there is a request > to benchmark different buffer sizes. > {code:title=org.apache.avro.hadoop.io.AvroSerializer} > /** >* The block size for the Avro encoder. >* >* This number was copied from the AvroSerialization of > org.apache.avro.mapred in Avro 1.5.1. >* >* TODO(gwu): Do some benchmarking with different numbers here to see if it > is important. >*/ > private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512; > /** An factory for creating Avro datum encoders. */ > private static EncoderFactory mEncoderFactory > = new > EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES); > {code} > However, there is no need to benchmark, this setting is superfluous and is > ignored with the current implementation. > {code:title=org.apache.avro.hadoop.io.AvroSerializer} > @Override > public void open(OutputStream outputStream) throws IOException { > mOutputStream = outputStream; > mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder); > } > {code} > {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting. > This setting is only relevant for calls to > {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} > which considers the configured "Block Size" for doing binary encoding of > blocked Array types as laid out in the > [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex]. > It can simply be removed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer
[ https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2049: -- Status: In Progress (was: Patch Available) > Remove Superfluous Configuration From AvroSerializer > > > Key: AVRO-2049 > URL: https://issues.apache.org/jira/browse/AVRO-2049 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.8.2, 1.7.7 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2049.1.patch, AVRO-2049.2.patch, AVRO-2049.3.patch > > > In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the > Avro block size is configured with a hard-coded value and there is a request > to benchmark different buffer sizes. > {code:title=org.apache.avro.hadoop.io.AvroSerializer} > /** >* The block size for the Avro encoder. >* >* This number was copied from the AvroSerialization of > org.apache.avro.mapred in Avro 1.5.1. >* >* TODO(gwu): Do some benchmarking with different numbers here to see if it > is important. >*/ > private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512; > /** An factory for creating Avro datum encoders. */ > private static EncoderFactory mEncoderFactory > = new > EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES); > {code} > However, there is no need to benchmark, this setting is superfluous and is > ignored with the current implementation. > {code:title=org.apache.avro.hadoop.io.AvroSerializer} > @Override > public void open(OutputStream outputStream) throws IOException { > mOutputStream = outputStream; > mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder); > } > {code} > {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting. > This setting is only relevant for calls to > {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} > which considers the configured "Block Size" for doing binary encoding of > blocked Array types as laid out in the > [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex]. > It can simply be removed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings
[ https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101786#comment-16101786 ] BELUGA BEHR commented on AVRO-2048: --- [~gszadovszky] I borrowed same implementation from [here|http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/tip/src/share/classes/java/util/ArrayList.java#l190] > Avro Binary Decoding - Gracefully Handle Long Strings > - > > Key: AVRO-2048 > URL: https://issues.apache.org/jira/browse/AVRO-2048 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2048.1.patch, AVRO-2048.2.patch, AVRO-2048.3.patch > > > According to the > [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]: > bq. a string is encoded as a *long* followed by that many bytes of UTF-8 > encoded character data. > However, that is currently not being adhered to: > {code:title=org.apache.avro.io.BinaryDecoder} > @Override > public Utf8 readString(Utf8 old) throws IOException { > int length = readInt(); > Utf8 result = (old != null ? old : new Utf8()); > result.setByteLength(length); > if (0 != length) { > doReadBytes(result.getBytes(), 0, length); > } > return result; > } > {code} > The first thing the code does here is to load an *int* value, not a *long*. > Because of the variable length nature of the size, this will mostly work. > However, there may be edge-cases where the serializer is putting in large > length values erroneously or nefariously. Let us gracefully detect such > scenarios and more closely adhere to the spec. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings
[ https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2048: -- Attachment: AVRO-2048.3.patch > Avro Binary Decoding - Gracefully Handle Long Strings > - > > Key: AVRO-2048 > URL: https://issues.apache.org/jira/browse/AVRO-2048 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2048.1.patch, AVRO-2048.2.patch, AVRO-2048.3.patch > > > According to the > [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]: > bq. a string is encoded as a *long* followed by that many bytes of UTF-8 > encoded character data. > However, that is currently not being adhered to: > {code:title=org.apache.avro.io.BinaryDecoder} > @Override > public Utf8 readString(Utf8 old) throws IOException { > int length = readInt(); > Utf8 result = (old != null ? old : new Utf8()); > result.setByteLength(length); > if (0 != length) { > doReadBytes(result.getBytes(), 0, length); > } > return result; > } > {code} > The first thing the code does here is to load an *int* value, not a *long*. > Because of the variable length nature of the size, this will mostly work. > However, there may be edge-cases where the serializer is putting in large > length values erroneously or nefariously. Let us gracefully detect such > scenarios and more closely adhere to the spec. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
[ https://issues.apache.org/jira/browse/AVRO-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16096959#comment-16096959 ] BELUGA BEHR edited comment on AVRO-2056 at 7/22/17 3:20 PM: With the patch applied, writing Doubles is between 1% and 2% faster and has the added benefit of cutting down on garbage collection {code} # Avro Master + Patch DoubleWrite: 2762 ms 72.407 579.258 200 DoubleWrite: 2786 ms 71.787 574.293 200 DoubleWrite: 2755 ms 72.570 580.561 200 # Avro Master DoubleWrite: 2822 ms 70.871 566.965 200 DoubleWrite: 2830 ms 70.667 565.336 200 DoubleWrite: 2807 ms 71.230 569.842 200 {code} was (Author: belugabehr): With the patch applied, writing Doubles is between 1% and 2% faster and has the added benefit of cutting down on garbage collection {code} # Buffer Removed DoubleWrite: 2762 ms 72.407 579.258 200 DoubleWrite: 2786 ms 71.787 574.293 200 DoubleWrite: 2755 ms 72.570 580.561 200 # Buffer Present DoubleWrite: 2822 ms 70.871 566.965 200 DoubleWrite: 2830 ms 70.667 565.336 200 DoubleWrite: 2807 ms 71.230 569.842 200 {code} > DirectBinaryEncoder Creates Buffer For Each Call To writeDouble > --- > > Key: AVRO-2056 > URL: https://issues.apache.org/jira/browse/AVRO-2056 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2056.1.patch > > > Each call to {{writeDouble}} creates a new buffer and promptly throws it away > even though the class has a re-usable buffer and is used in other methods > such as {{writeFloat}}. Remove this extra buffer. > {code:title=org.apache.avro.io.DirectBinaryEncoder} > // the buffer is used for writing floats, doubles, and large longs. > private final byte[] buf = new byte[12]; > @Override > public void writeFloat(float f) throws IOException { > int len = BinaryData.encodeFloat(f, buf, 0); > out.write(buf, 0, len); > } > @Override > public void writeDouble(double d) throws IOException { > byte[] buf = new byte[8]; > int len = BinaryData.encodeDouble(d, buf, 0); > out.write(buf, 0, len); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings
[ https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16097362#comment-16097362 ] BELUGA BEHR commented on AVRO-2048: --- I can't explain why this is, but it seems to be a tad faster with this patch: {code} # Avro Master StringRead: 3291 ms 12.152 432.835 1780910 StringRead: 3290 ms 12.155 432.949 1780910 StringRead: 3287 ms 12.166 433.320 1780910 # Avro Master + Patch StringRead: 3270 ms 12.229 435.574 1780910 StringRead: 3288 ms 12.163 433.241 1780910 StringRead: 3271 ms 12.227 435.521 1780910 {code} > Avro Binary Decoding - Gracefully Handle Long Strings > - > > Key: AVRO-2048 > URL: https://issues.apache.org/jira/browse/AVRO-2048 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2048.1.patch, AVRO-2048.2.patch > > > According to the > [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]: > bq. a string is encoded as a *long* followed by that many bytes of UTF-8 > encoded character data. > However, that is currently not being adhered to: > {code:title=org.apache.avro.io.BinaryDecoder} > @Override > public Utf8 readString(Utf8 old) throws IOException { > int length = readInt(); > Utf8 result = (old != null ? old : new Utf8()); > result.setByteLength(length); > if (0 != length) { > doReadBytes(result.getBytes(), 0, length); > } > return result; > } > {code} > The first thing the code does here is to load an *int* value, not a *long*. > Because of the variable length nature of the size, this will mostly work. > However, there may be edge-cases where the serializer is putting in large > length values erroneously or nefariously. Let us gracefully detect such > scenarios and more closely adhere to the spec. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (AVRO-2052) Remove org.apache.avro.file.DataFileWriter Double Buffering
[ https://issues.apache.org/jira/browse/AVRO-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR resolved AVRO-2052. --- Resolution: Not A Problem [~cutting] I just ran the Perf test. Indeed, the {{directBinaryEncoder}} is slower than the {{binaryEncoder}} even when writing into a buffered OutputStream. Closing the ticket. > Remove org.apache.avro.file.DataFileWriter Double Buffering > --- > > Key: AVRO-2052 > URL: https://issues.apache.org/jira/browse/AVRO-2052 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2052.1.patch > > > {code:title=org.apache.avro.file.DataFileWriter} > private void init(OutputStream outs) throws IOException { > this.underlyingStream = outs; > this.out = new BufferedFileOutputStream(outs); > EncoderFactory efactory = new EncoderFactory(); > this.vout = efactory.binaryEncoder(out, null); > dout.setSchema(schema); > buffer = new NonCopyingByteArrayOutputStream( > Math.min((int)(syncInterval * 1.25), Integer.MAX_VALUE/2 -1)); > this.bufOut = efactory.binaryEncoder(buffer, null); > if (this.codec == null) { > this.codec = CodecFactory.nullCodec().createInstance(); > } > this.isOpen = true; > } > {code} > It's clear here that both streams are writing to a buffered destination, {{ > BufferedFileOutputStream}} and {{ByteArrayOutputStream}} therefore there is > no reason to need a buffered encoder and instead, write directly to the > buffered streams with {{directBinaryEncoder}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
[ https://issues.apache.org/jira/browse/AVRO-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16096959#comment-16096959 ] BELUGA BEHR commented on AVRO-2056: --- With the patch applied, writing Doubles is between 1% and 2% faster and has the added benefit of cutting down on garbage collection {code} # Buffer Removed DoubleWrite: 2762 ms 72.407 579.258 200 DoubleWrite: 2786 ms 71.787 574.293 200 DoubleWrite: 2755 ms 72.570 580.561 200 # Buffer Present DoubleWrite: 2822 ms 70.871 566.965 200 DoubleWrite: 2830 ms 70.667 565.336 200 DoubleWrite: 2807 ms 71.230 569.842 200 {code} > DirectBinaryEncoder Creates Buffer For Each Call To writeDouble > --- > > Key: AVRO-2056 > URL: https://issues.apache.org/jira/browse/AVRO-2056 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2056.1.patch > > > Each call to {{writeDouble}} creates a new buffer and promptly throws it away > even though the class has a re-usable buffer and is used in other methods > such as {{writeFloat}}. Remove this extra buffer. > {code:title=org.apache.avro.io.DirectBinaryEncoder} > // the buffer is used for writing floats, doubles, and large longs. > private final byte[] buf = new byte[12]; > @Override > public void writeFloat(float f) throws IOException { > int len = BinaryData.encodeFloat(f, buf, 0); > out.write(buf, 0, len); > } > @Override > public void writeDouble(double d) throws IOException { > byte[] buf = new byte[8]; > int len = BinaryData.encodeDouble(d, buf, 0); > out.write(buf, 0, len); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings
[ https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2048: -- Attachment: AVRO-2048.2.patch Fixed checkstyle failures > Avro Binary Decoding - Gracefully Handle Long Strings > - > > Key: AVRO-2048 > URL: https://issues.apache.org/jira/browse/AVRO-2048 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2048.1.patch, AVRO-2048.2.patch > > > According to the > [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]: > bq. a string is encoded as a *long* followed by that many bytes of UTF-8 > encoded character data. > However, that is currently not being adhered to: > {code:title=org.apache.avro.io.BinaryDecoder} > @Override > public Utf8 readString(Utf8 old) throws IOException { > int length = readInt(); > Utf8 result = (old != null ? old : new Utf8()); > result.setByteLength(length); > if (0 != length) { > doReadBytes(result.getBytes(), 0, length); > } > return result; > } > {code} > The first thing the code does here is to load an *int* value, not a *long*. > Because of the variable length nature of the size, this will mostly work. > However, there may be edge-cases where the serializer is putting in large > length values erroneously or nefariously. Let us gracefully detect such > scenarios and more closely adhere to the spec. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2054) Use StringBuilder instead of StringBuffer
[ https://issues.apache.org/jira/browse/AVRO-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2054: -- Attachment: AVRO-2054.2.patch Renamed the automatic variable from _buffer_ to _builder_. > Use StringBuilder instead of StringBuffer > - > > Key: AVRO-2054 > URL: https://issues.apache.org/jira/browse/AVRO-2054 > Project: Avro > Issue Type: Improvement >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2054.1.patch, AVRO-2054.2.patch > > > Use the un-synchronized StringBuilder instead of StringBuffer. Use _char_ > values instead of Strings. > {code:title=org.apache.trevni.MetaData} > @Override public String toString() { > StringBuffer buffer = new StringBuffer(); > buffer.append("{ "); > for (Map.Entrye : entrySet()) { > buffer.append(e.getKey()); > buffer.append("="); > try { > buffer.append(new String(e.getValue(), "ISO-8859-1")); > } catch (java.io.UnsupportedEncodingException error) { > throw new TrevniRuntimeException(error); > } > buffer.append(" "); > } > buffer.append("}"); > return buffer.toString(); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Issue Comment Deleted] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
[ https://issues.apache.org/jira/browse/AVRO-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2056: -- Comment: was deleted (was: Renamed the automatic variable from _buffer_ to _builder_.) > DirectBinaryEncoder Creates Buffer For Each Call To writeDouble > --- > > Key: AVRO-2056 > URL: https://issues.apache.org/jira/browse/AVRO-2056 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2056.1.patch > > > Each call to {{writeDouble}} creates a new buffer and promptly throws it away > even though the class has a re-usable buffer and is used in other methods > such as {{writeFloat}}. Remove this extra buffer. > {code:title=org.apache.avro.io.DirectBinaryEncoder} > // the buffer is used for writing floats, doubles, and large longs. > private final byte[] buf = new byte[12]; > @Override > public void writeFloat(float f) throws IOException { > int len = BinaryData.encodeFloat(f, buf, 0); > out.write(buf, 0, len); > } > @Override > public void writeDouble(double d) throws IOException { > byte[] buf = new byte[8]; > int len = BinaryData.encodeDouble(d, buf, 0); > out.write(buf, 0, len); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
[ https://issues.apache.org/jira/browse/AVRO-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2056: -- Attachment: AVRO-2054.2.patch Renamed the automatic variable from _buffer_ to _builder_. > DirectBinaryEncoder Creates Buffer For Each Call To writeDouble > --- > > Key: AVRO-2056 > URL: https://issues.apache.org/jira/browse/AVRO-2056 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2056.1.patch > > > Each call to {{writeDouble}} creates a new buffer and promptly throws it away > even though the class has a re-usable buffer and is used in other methods > such as {{writeFloat}}. Remove this extra buffer. > {code:title=org.apache.avro.io.DirectBinaryEncoder} > // the buffer is used for writing floats, doubles, and large longs. > private final byte[] buf = new byte[12]; > @Override > public void writeFloat(float f) throws IOException { > int len = BinaryData.encodeFloat(f, buf, 0); > out.write(buf, 0, len); > } > @Override > public void writeDouble(double d) throws IOException { > byte[] buf = new byte[8]; > int len = BinaryData.encodeDouble(d, buf, 0); > out.write(buf, 0, len); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
[ https://issues.apache.org/jira/browse/AVRO-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2056: -- Attachment: (was: AVRO-2054.2.patch) > DirectBinaryEncoder Creates Buffer For Each Call To writeDouble > --- > > Key: AVRO-2056 > URL: https://issues.apache.org/jira/browse/AVRO-2056 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2056.1.patch > > > Each call to {{writeDouble}} creates a new buffer and promptly throws it away > even though the class has a re-usable buffer and is used in other methods > such as {{writeFloat}}. Remove this extra buffer. > {code:title=org.apache.avro.io.DirectBinaryEncoder} > // the buffer is used for writing floats, doubles, and large longs. > private final byte[] buf = new byte[12]; > @Override > public void writeFloat(float f) throws IOException { > int len = BinaryData.encodeFloat(f, buf, 0); > out.write(buf, 0, len); > } > @Override > public void writeDouble(double d) throws IOException { > byte[] buf = new byte[8]; > int len = BinaryData.encodeDouble(d, buf, 0); > out.write(buf, 0, len); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
[ https://issues.apache.org/jira/browse/AVRO-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2056: -- Affects Version/s: 1.7.7 1.8.2 > DirectBinaryEncoder Creates Buffer For Each Call To writeDouble > --- > > Key: AVRO-2056 > URL: https://issues.apache.org/jira/browse/AVRO-2056 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2056.1.patch > > > Each call to {{writeDouble}} creates a new buffer and promptly throws it away > even though the class has a re-usable buffer and is used in other methods > such as {{writeFloat}}. Remove this extra buffer. > {code:title=org.apache.avro.io.DirectBinaryEncoder} > // the buffer is used for writing floats, doubles, and large longs. > private final byte[] buf = new byte[12]; > @Override > public void writeFloat(float f) throws IOException { > int len = BinaryData.encodeFloat(f, buf, 0); > out.write(buf, 0, len); > } > @Override > public void writeDouble(double d) throws IOException { > byte[] buf = new byte[8]; > int len = BinaryData.encodeDouble(d, buf, 0); > out.write(buf, 0, len); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
[ https://issues.apache.org/jira/browse/AVRO-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2056: -- Attachment: AVRO-2056.1.patch > DirectBinaryEncoder Creates Buffer For Each Call To writeDouble > --- > > Key: AVRO-2056 > URL: https://issues.apache.org/jira/browse/AVRO-2056 > Project: Avro > Issue Type: Improvement > Components: java >Reporter: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2056.1.patch > > > Each call to {{writeDouble}} creates a new buffer and promptly throws it away > even though the class has a re-usable buffer and is used in other methods > such as {{writeFloat}}. Remove this extra buffer. > {code:title=org.apache.avro.io.DirectBinaryEncoder} > // the buffer is used for writing floats, doubles, and large longs. > private final byte[] buf = new byte[12]; > @Override > public void writeFloat(float f) throws IOException { > int len = BinaryData.encodeFloat(f, buf, 0); > out.write(buf, 0, len); > } > @Override > public void writeDouble(double d) throws IOException { > byte[] buf = new byte[8]; > int len = BinaryData.encodeDouble(d, buf, 0); > out.write(buf, 0, len); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
[ https://issues.apache.org/jira/browse/AVRO-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR reassigned AVRO-2056: - Assignee: BELUGA BEHR > DirectBinaryEncoder Creates Buffer For Each Call To writeDouble > --- > > Key: AVRO-2056 > URL: https://issues.apache.org/jira/browse/AVRO-2056 > Project: Avro > Issue Type: Improvement > Components: java >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2056.1.patch > > > Each call to {{writeDouble}} creates a new buffer and promptly throws it away > even though the class has a re-usable buffer and is used in other methods > such as {{writeFloat}}. Remove this extra buffer. > {code:title=org.apache.avro.io.DirectBinaryEncoder} > // the buffer is used for writing floats, doubles, and large longs. > private final byte[] buf = new byte[12]; > @Override > public void writeFloat(float f) throws IOException { > int len = BinaryData.encodeFloat(f, buf, 0); > out.write(buf, 0, len); > } > @Override > public void writeDouble(double d) throws IOException { > byte[] buf = new byte[8]; > int len = BinaryData.encodeDouble(d, buf, 0); > out.write(buf, 0, len); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
BELUGA BEHR created AVRO-2056: - Summary: DirectBinaryEncoder Creates Buffer For Each Call To writeDouble Key: AVRO-2056 URL: https://issues.apache.org/jira/browse/AVRO-2056 Project: Avro Issue Type: Improvement Components: java Reporter: BELUGA BEHR Priority: Minor Each call to {{writeDouble}} creates a new buffer and promptly throws it away even though the class has a re-usable buffer and is used in other methods such as {{writeFloat}}. Remove this extra buffer. {code:title=org.apache.avro.io.DirectBinaryEncoder} // the buffer is used for writing floats, doubles, and large longs. private final byte[] buf = new byte[12]; @Override public void writeFloat(float f) throws IOException { int len = BinaryData.encodeFloat(f, buf, 0); out.write(buf, 0, len); } @Override public void writeDouble(double d) throws IOException { byte[] buf = new byte[8]; int len = BinaryData.encodeDouble(d, buf, 0); out.write(buf, 0, len); } {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
[ https://issues.apache.org/jira/browse/AVRO-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2056: -- Status: Patch Available (was: Open) > DirectBinaryEncoder Creates Buffer For Each Call To writeDouble > --- > > Key: AVRO-2056 > URL: https://issues.apache.org/jira/browse/AVRO-2056 > Project: Avro > Issue Type: Improvement > Components: java >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2056.1.patch > > > Each call to {{writeDouble}} creates a new buffer and promptly throws it away > even though the class has a re-usable buffer and is used in other methods > such as {{writeFloat}}. Remove this extra buffer. > {code:title=org.apache.avro.io.DirectBinaryEncoder} > // the buffer is used for writing floats, doubles, and large longs. > private final byte[] buf = new byte[12]; > @Override > public void writeFloat(float f) throws IOException { > int len = BinaryData.encodeFloat(f, buf, 0); > out.write(buf, 0, len); > } > @Override > public void writeDouble(double d) throws IOException { > byte[] buf = new byte[8]; > int len = BinaryData.encodeDouble(d, buf, 0); > out.write(buf, 0, len); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AVRO-2052) Remove org.apache.avro.file.DataFileWriter Double Buffering
[ https://issues.apache.org/jira/browse/AVRO-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16092115#comment-16092115 ] BELUGA BEHR commented on AVRO-2052: --- [~cutting] There is a small amount of buffering in {{DirectBinaryEncoder}}. There exists a 12-byte buffer. However, in one case this buffer is not used and an array is instantiated for each invocation. Seems like an oversight to me. Thoughts? {code:title=org.apache.avro.io.DirectBinaryEncoder.writeDouble(double)} @Override public void writeDouble(double d) throws IOException { byte[] buf = new byte[8]; int len = BinaryData.encodeDouble(d, buf, 0); out.write(buf, 0, len); } {code} > Remove org.apache.avro.file.DataFileWriter Double Buffering > --- > > Key: AVRO-2052 > URL: https://issues.apache.org/jira/browse/AVRO-2052 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2052.1.patch > > > {code:title=org.apache.avro.file.DataFileWriter} > private void init(OutputStream outs) throws IOException { > this.underlyingStream = outs; > this.out = new BufferedFileOutputStream(outs); > EncoderFactory efactory = new EncoderFactory(); > this.vout = efactory.binaryEncoder(out, null); > dout.setSchema(schema); > buffer = new NonCopyingByteArrayOutputStream( > Math.min((int)(syncInterval * 1.25), Integer.MAX_VALUE/2 -1)); > this.bufOut = efactory.binaryEncoder(buffer, null); > if (this.codec == null) { > this.codec = CodecFactory.nullCodec().createInstance(); > } > this.isOpen = true; > } > {code} > It's clear here that both streams are writing to a buffered destination, {{ > BufferedFileOutputStream}} and {{ByteArrayOutputStream}} therefore there is > no reason to need a buffered encoder and instead, write directly to the > buffered streams with {{directBinaryEncoder}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AVRO-2050) Clear Array To Allow GC
[ https://issues.apache.org/jira/browse/AVRO-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16092098#comment-16092098 ] BELUGA BEHR commented on AVRO-2050: --- [~cutting] Thanks for the input. I was simply mimicking the OpenJDK implementation. I attached a new patch. Either will work. :) > Clear Array To Allow GC > --- > > Key: AVRO-2050 > URL: https://issues.apache.org/jira/browse/AVRO-2050 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2050.1.patch, AVRO-2050.2.patch, AVRO-2050.3.patch > > > Java's {{ArrayList}} implementation clears all Objects from the internal > buffer when the {{clear()}} method is called. This allows the Objects to be > free for GC. We should do the same in Avro > {{org.apache.avro.generic.GenericData}} > [ArrayList > Source|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/ArrayList.java#ArrayList.clear%28%29] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2050) Clear Array To Allow GC
[ https://issues.apache.org/jira/browse/AVRO-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2050: -- Status: Patch Available (was: Open) > Clear Array To Allow GC > --- > > Key: AVRO-2050 > URL: https://issues.apache.org/jira/browse/AVRO-2050 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.8.2, 1.7.7 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2050.1.patch, AVRO-2050.2.patch, AVRO-2050.3.patch > > > Java's {{ArrayList}} implementation clears all Objects from the internal > buffer when the {{clear()}} method is called. This allows the Objects to be > free for GC. We should do the same in Avro > {{org.apache.avro.generic.GenericData}} > [ArrayList > Source|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/ArrayList.java#ArrayList.clear%28%29] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2050) Clear Array To Allow GC
[ https://issues.apache.org/jira/browse/AVRO-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2050: -- Status: Open (was: Patch Available) > Clear Array To Allow GC > --- > > Key: AVRO-2050 > URL: https://issues.apache.org/jira/browse/AVRO-2050 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.8.2, 1.7.7 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2050.1.patch, AVRO-2050.2.patch, AVRO-2050.3.patch > > > Java's {{ArrayList}} implementation clears all Objects from the internal > buffer when the {{clear()}} method is called. This allows the Objects to be > free for GC. We should do the same in Avro > {{org.apache.avro.generic.GenericData}} > [ArrayList > Source|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/ArrayList.java#ArrayList.clear%28%29] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2050) Clear Array To Allow GC
[ https://issues.apache.org/jira/browse/AVRO-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2050: -- Attachment: AVRO-2050.3.patch > Clear Array To Allow GC > --- > > Key: AVRO-2050 > URL: https://issues.apache.org/jira/browse/AVRO-2050 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2050.1.patch, AVRO-2050.2.patch, AVRO-2050.3.patch > > > Java's {{ArrayList}} implementation clears all Objects from the internal > buffer when the {{clear()}} method is called. This allows the Objects to be > free for GC. We should do the same in Avro > {{org.apache.avro.generic.GenericData}} > [ArrayList > Source|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/ArrayList.java#ArrayList.clear%28%29] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2055) Remove Magic Value From org.apache.avro.hadoop.io.AvroSequenceFile
[ https://issues.apache.org/jira/browse/AVRO-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2055: -- Status: Patch Available (was: Open) > Remove Magic Value From org.apache.avro.hadoop.io.AvroSequenceFile > -- > > Key: AVRO-2055 > URL: https://issues.apache.org/jira/browse/AVRO-2055 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.8.2, 1.7.7 >Reporter: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2055.1.patch > > > Remove magic string _io.file.buffer.size_ and _DEFAULT_BUFFER_SIZE_BYTES_ and > instead rely on the Hadoop libraries to provide this information. Will help > to keep Avro in sync with changes in Hadoop. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2055) Remove Magic Value From org.apache.avro.hadoop.io.AvroSequenceFile
[ https://issues.apache.org/jira/browse/AVRO-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2055: -- Attachment: AVRO-2055.1.patch > Remove Magic Value From org.apache.avro.hadoop.io.AvroSequenceFile > -- > > Key: AVRO-2055 > URL: https://issues.apache.org/jira/browse/AVRO-2055 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2055.1.patch > > > Remove magic string _io.file.buffer.size_ and _DEFAULT_BUFFER_SIZE_BYTES_ and > instead rely on the Hadoop libraries to provide this information. Will help > to keep Avro in sync with changes in Hadoop. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (AVRO-2055) Remove Magic Value From org.apache.avro.hadoop.io.AvroSequenceFile
[ https://issues.apache.org/jira/browse/AVRO-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR reassigned AVRO-2055: - Assignee: BELUGA BEHR > Remove Magic Value From org.apache.avro.hadoop.io.AvroSequenceFile > -- > > Key: AVRO-2055 > URL: https://issues.apache.org/jira/browse/AVRO-2055 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2055.1.patch > > > Remove magic string _io.file.buffer.size_ and _DEFAULT_BUFFER_SIZE_BYTES_ and > instead rely on the Hadoop libraries to provide this information. Will help > to keep Avro in sync with changes in Hadoop. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AVRO-2055) Remove Magic Value From org.apache.avro.hadoop.io.AvroSequenceFile
BELUGA BEHR created AVRO-2055: - Summary: Remove Magic Value From org.apache.avro.hadoop.io.AvroSequenceFile Key: AVRO-2055 URL: https://issues.apache.org/jira/browse/AVRO-2055 Project: Avro Issue Type: Improvement Components: java Affects Versions: 1.8.2, 1.7.7 Reporter: BELUGA BEHR Priority: Trivial Remove magic string _io.file.buffer.size_ and _DEFAULT_BUFFER_SIZE_BYTES_ and instead rely on the Hadoop libraries to provide this information. Will help to keep Avro in sync with changes in Hadoop. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2054) Use StringBuilder instead of StringBuffer
[ https://issues.apache.org/jira/browse/AVRO-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2054: -- Status: Patch Available (was: Open) > Use StringBuilder instead of StringBuffer > - > > Key: AVRO-2054 > URL: https://issues.apache.org/jira/browse/AVRO-2054 > Project: Avro > Issue Type: Improvement >Affects Versions: 1.8.2, 1.7.7 >Reporter: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2054.1.patch > > > Use the un-synchronized StringBuilder instead of StringBuffer. Use _char_ > values instead of Strings. > {code:title=org.apache.trevni.MetaData} > @Override public String toString() { > StringBuffer buffer = new StringBuffer(); > buffer.append("{ "); > for (Map.Entrye : entrySet()) { > buffer.append(e.getKey()); > buffer.append("="); > try { > buffer.append(new String(e.getValue(), "ISO-8859-1")); > } catch (java.io.UnsupportedEncodingException error) { > throw new TrevniRuntimeException(error); > } > buffer.append(" "); > } > buffer.append("}"); > return buffer.toString(); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2054) Use StringBuilder instead of StringBuffer
[ https://issues.apache.org/jira/browse/AVRO-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2054: -- Attachment: AVRO-2054.1.patch > Use StringBuilder instead of StringBuffer > - > > Key: AVRO-2054 > URL: https://issues.apache.org/jira/browse/AVRO-2054 > Project: Avro > Issue Type: Improvement >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2054.1.patch > > > Use the un-synchronized StringBuilder instead of StringBuffer. Use _char_ > values instead of Strings. > {code:title=org.apache.trevni.MetaData} > @Override public String toString() { > StringBuffer buffer = new StringBuffer(); > buffer.append("{ "); > for (Map.Entrye : entrySet()) { > buffer.append(e.getKey()); > buffer.append("="); > try { > buffer.append(new String(e.getValue(), "ISO-8859-1")); > } catch (java.io.UnsupportedEncodingException error) { > throw new TrevniRuntimeException(error); > } > buffer.append(" "); > } > buffer.append("}"); > return buffer.toString(); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (AVRO-2054) Use StringBuilder instead of StringBuffer
[ https://issues.apache.org/jira/browse/AVRO-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR reassigned AVRO-2054: - Assignee: BELUGA BEHR > Use StringBuilder instead of StringBuffer > - > > Key: AVRO-2054 > URL: https://issues.apache.org/jira/browse/AVRO-2054 > Project: Avro > Issue Type: Improvement >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2054.1.patch > > > Use the un-synchronized StringBuilder instead of StringBuffer. Use _char_ > values instead of Strings. > {code:title=org.apache.trevni.MetaData} > @Override public String toString() { > StringBuffer buffer = new StringBuffer(); > buffer.append("{ "); > for (Map.Entrye : entrySet()) { > buffer.append(e.getKey()); > buffer.append("="); > try { > buffer.append(new String(e.getValue(), "ISO-8859-1")); > } catch (java.io.UnsupportedEncodingException error) { > throw new TrevniRuntimeException(error); > } > buffer.append(" "); > } > buffer.append("}"); > return buffer.toString(); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AVRO-2054) Use StringBuilder instead of StringBuffer
BELUGA BEHR created AVRO-2054: - Summary: Use StringBuilder instead of StringBuffer Key: AVRO-2054 URL: https://issues.apache.org/jira/browse/AVRO-2054 Project: Avro Issue Type: Improvement Affects Versions: 1.8.2, 1.7.7 Reporter: BELUGA BEHR Priority: Trivial Use the un-synchronized StringBuilder instead of StringBuffer. Use _char_ values instead of Strings. {code:title=org.apache.trevni.MetaData} @Override public String toString() { StringBuffer buffer = new StringBuffer(); buffer.append("{ "); for (Map.Entrye : entrySet()) { buffer.append(e.getKey()); buffer.append("="); try { buffer.append(new String(e.getValue(), "ISO-8859-1")); } catch (java.io.UnsupportedEncodingException error) { throw new TrevniRuntimeException(error); } buffer.append(" "); } buffer.append("}"); return buffer.toString(); } {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2053) Remove Reference To Deprecated Property mapred.output.compression.type
[ https://issues.apache.org/jira/browse/AVRO-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2053: -- Description: Avro utilizes [deprecated|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/DeprecatedProperties.html] property _mapred.output.compression.type_. Update code to use the MRv2 property and don't override default behaviors/settings. Use the appropriate facilities from {{org.apache.hadoop.mapreduce.lib.output.FileOutputFormat}} and {{org.apache.hadoop.io.SequenceFile}}. {code:title=org.apache.avro.mapreduce.AvroSequenceFileOutputFormat} /** Configuration key for storing the type of compression for the target sequence file. */ private static final String CONF_COMPRESSION_TYPE = "mapred.output.compression.type"; /** The default compression type for the target sequence file. */ private static final CompressionType DEFAULT_COMPRESSION_TYPE = CompressionType.RECORD; {code} was: Avro utilizes [deprecated|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/DeprecatedProperties.html] property "mapred.output.compression.type". Update code to use the MRv2 property and don't override default behaviors/settings. Use the appropriate facilities from {{org.apache.hadoop.mapreduce.lib.output.FileOutputFormat}} and {{org.apache.hadoop.io.SequenceFile}}. {code:title=org.apache.avro.mapreduce.AvroSequenceFileOutputFormat} /** Configuration key for storing the type of compression for the target sequence file. */ private static final String CONF_COMPRESSION_TYPE = "mapred.output.compression.type"; /** The default compression type for the target sequence file. */ private static final CompressionType DEFAULT_COMPRESSION_TYPE = CompressionType.RECORD; {code} > Remove Reference To Deprecated Property mapred.output.compression.type > -- > > Key: AVRO-2053 > URL: https://issues.apache.org/jira/browse/AVRO-2053 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2053.1.patch > > > Avro utilizes > [deprecated|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/DeprecatedProperties.html] > property _mapred.output.compression.type_. Update code to use the MRv2 > property and don't override default behaviors/settings. Use the appropriate > facilities from {{org.apache.hadoop.mapreduce.lib.output.FileOutputFormat}} > and {{org.apache.hadoop.io.SequenceFile}}. > {code:title=org.apache.avro.mapreduce.AvroSequenceFileOutputFormat} > /** Configuration key for storing the type of compression for the target > sequence file. */ > private static final String CONF_COMPRESSION_TYPE = > "mapred.output.compression.type"; > /** The default compression type for the target sequence file. */ > private static final CompressionType DEFAULT_COMPRESSION_TYPE = > CompressionType.RECORD; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (AVRO-2053) Remove Reference To Deprecated Property mapred.output.compression.type
[ https://issues.apache.org/jira/browse/AVRO-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR reassigned AVRO-2053: - Assignee: BELUGA BEHR > Remove Reference To Deprecated Property mapred.output.compression.type > -- > > Key: AVRO-2053 > URL: https://issues.apache.org/jira/browse/AVRO-2053 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2053.1.patch > > > Avro utilizes > [deprecated|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/DeprecatedProperties.html] > property "mapred.output.compression.type". Update code to use the MRv2 > property and don't override default behaviors/settings. Use the appropriate > facilities from {{org.apache.hadoop.mapreduce.lib.output.FileOutputFormat}} > and {{org.apache.hadoop.io.SequenceFile}}. > {code:title=org.apache.avro.mapreduce.AvroSequenceFileOutputFormat} > /** Configuration key for storing the type of compression for the target > sequence file. */ > private static final String CONF_COMPRESSION_TYPE = > "mapred.output.compression.type"; > /** The default compression type for the target sequence file. */ > private static final CompressionType DEFAULT_COMPRESSION_TYPE = > CompressionType.RECORD; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2053) Remove Reference To Deprecated Property mapred.output.compression.type
[ https://issues.apache.org/jira/browse/AVRO-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2053: -- Status: Patch Available (was: Open) > Remove Reference To Deprecated Property mapred.output.compression.type > -- > > Key: AVRO-2053 > URL: https://issues.apache.org/jira/browse/AVRO-2053 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.8.2, 1.7.7 >Reporter: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2053.1.patch > > > Avro utilizes > [deprecated|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/DeprecatedProperties.html] > property "mapred.output.compression.type". Update code to use the MRv2 > property and don't override default behaviors/settings. Use the appropriate > facilities from {{org.apache.hadoop.mapreduce.lib.output.FileOutputFormat}} > and {{org.apache.hadoop.io.SequenceFile}}. > {code:title=org.apache.avro.mapreduce.AvroSequenceFileOutputFormat} > /** Configuration key for storing the type of compression for the target > sequence file. */ > private static final String CONF_COMPRESSION_TYPE = > "mapred.output.compression.type"; > /** The default compression type for the target sequence file. */ > private static final CompressionType DEFAULT_COMPRESSION_TYPE = > CompressionType.RECORD; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2053) Remove Reference To Deprecated Property mapred.output.compression.type
[ https://issues.apache.org/jira/browse/AVRO-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2053: -- Attachment: AVRO-2053.1.patch > Remove Reference To Deprecated Property mapred.output.compression.type > -- > > Key: AVRO-2053 > URL: https://issues.apache.org/jira/browse/AVRO-2053 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2053.1.patch > > > Avro utilizes > [deprecated|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/DeprecatedProperties.html] > property "mapred.output.compression.type". Update code to use the MRv2 > property and don't override default behaviors/settings. Use the appropriate > facilities from {{org.apache.hadoop.mapreduce.lib.output.FileOutputFormat}} > and {{org.apache.hadoop.io.SequenceFile}}. > {code:title=org.apache.avro.mapreduce.AvroSequenceFileOutputFormat} > /** Configuration key for storing the type of compression for the target > sequence file. */ > private static final String CONF_COMPRESSION_TYPE = > "mapred.output.compression.type"; > /** The default compression type for the target sequence file. */ > private static final CompressionType DEFAULT_COMPRESSION_TYPE = > CompressionType.RECORD; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AVRO-2053) Remove Reference To Deprecated Property mapred.output.compression.type
BELUGA BEHR created AVRO-2053: - Summary: Remove Reference To Deprecated Property mapred.output.compression.type Key: AVRO-2053 URL: https://issues.apache.org/jira/browse/AVRO-2053 Project: Avro Issue Type: Improvement Components: java Affects Versions: 1.8.2, 1.7.7 Reporter: BELUGA BEHR Priority: Trivial Avro utilizes [deprecated|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/DeprecatedProperties.html] property "mapred.output.compression.type". Update code to use the MRv2 property and don't override default behaviors/settings. Use the appropriate facilities from {{org.apache.hadoop.mapreduce.lib.output.FileOutputFormat}} and {{org.apache.hadoop.io.SequenceFile}}. {code:title=org.apache.avro.mapreduce.AvroSequenceFileOutputFormat} /** Configuration key for storing the type of compression for the target sequence file. */ private static final String CONF_COMPRESSION_TYPE = "mapred.output.compression.type"; /** The default compression type for the target sequence file. */ private static final CompressionType DEFAULT_COMPRESSION_TYPE = CompressionType.RECORD; {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings
[ https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR reassigned AVRO-2048: - Assignee: BELUGA BEHR > Avro Binary Decoding - Gracefully Handle Long Strings > - > > Key: AVRO-2048 > URL: https://issues.apache.org/jira/browse/AVRO-2048 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2048.1.patch > > > According to the > [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]: > bq. a string is encoded as a *long* followed by that many bytes of UTF-8 > encoded character data. > However, that is currently not being adhered to: > {code:title=org.apache.avro.io.BinaryDecoder} > @Override > public Utf8 readString(Utf8 old) throws IOException { > int length = readInt(); > Utf8 result = (old != null ? old : new Utf8()); > result.setByteLength(length); > if (0 != length) { > doReadBytes(result.getBytes(), 0, length); > } > return result; > } > {code} > The first thing the code does here is to load an *int* value, not a *long*. > Because of the variable length nature of the size, this will mostly work. > However, there may be edge-cases where the serializer is putting in large > length values erroneously or nefariously. Let us gracefully detect such > scenarios and more closely adhere to the spec. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (AVRO-2052) Remove org.apache.avro.file.DataFileWriter Double Buffering
[ https://issues.apache.org/jira/browse/AVRO-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR reassigned AVRO-2052: - Assignee: BELUGA BEHR > Remove org.apache.avro.file.DataFileWriter Double Buffering > --- > > Key: AVRO-2052 > URL: https://issues.apache.org/jira/browse/AVRO-2052 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2052.1.patch > > > {code:title=org.apache.avro.file.DataFileWriter} > private void init(OutputStream outs) throws IOException { > this.underlyingStream = outs; > this.out = new BufferedFileOutputStream(outs); > EncoderFactory efactory = new EncoderFactory(); > this.vout = efactory.binaryEncoder(out, null); > dout.setSchema(schema); > buffer = new NonCopyingByteArrayOutputStream( > Math.min((int)(syncInterval * 1.25), Integer.MAX_VALUE/2 -1)); > this.bufOut = efactory.binaryEncoder(buffer, null); > if (this.codec == null) { > this.codec = CodecFactory.nullCodec().createInstance(); > } > this.isOpen = true; > } > {code} > It's clear here that both streams are writing to a buffered destination, {{ > BufferedFileOutputStream}} and {{ByteArrayOutputStream}} therefore there is > no reason to need a buffered encoder and instead, write directly to the > buffered streams with {{directBinaryEncoder}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2052) Remove org.apache.avro.file.DataFileWriter Double Buffering
[ https://issues.apache.org/jira/browse/AVRO-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2052: -- Attachment: AVRO-2052.1.patch Call {{directBinaryEncoder}} instead of the buffered {{binaryEncoder}} > Remove org.apache.avro.file.DataFileWriter Double Buffering > --- > > Key: AVRO-2052 > URL: https://issues.apache.org/jira/browse/AVRO-2052 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2052.1.patch > > > {code:title=org.apache.avro.file.DataFileWriter} > private void init(OutputStream outs) throws IOException { > this.underlyingStream = outs; > this.out = new BufferedFileOutputStream(outs); > EncoderFactory efactory = new EncoderFactory(); > this.vout = efactory.binaryEncoder(out, null); > dout.setSchema(schema); > buffer = new NonCopyingByteArrayOutputStream( > Math.min((int)(syncInterval * 1.25), Integer.MAX_VALUE/2 -1)); > this.bufOut = efactory.binaryEncoder(buffer, null); > if (this.codec == null) { > this.codec = CodecFactory.nullCodec().createInstance(); > } > this.isOpen = true; > } > {code} > It's clear here that both streams are writing to a buffered destination, {{ > BufferedFileOutputStream}} and {{ByteArrayOutputStream}} therefore there is > no reason to need a buffered encoder and instead, write directly to the > buffered streams with {{directBinaryEncoder}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AVRO-2052) Remove org.apache.avro.file.DataFileWriter Double Buffering
BELUGA BEHR created AVRO-2052: - Summary: Remove org.apache.avro.file.DataFileWriter Double Buffering Key: AVRO-2052 URL: https://issues.apache.org/jira/browse/AVRO-2052 Project: Avro Issue Type: Improvement Components: java Affects Versions: 1.8.2, 1.7.7 Reporter: BELUGA BEHR Priority: Trivial {code:title=org.apache.avro.file.DataFileWriter} private void init(OutputStream outs) throws IOException { this.underlyingStream = outs; this.out = new BufferedFileOutputStream(outs); EncoderFactory efactory = new EncoderFactory(); this.vout = efactory.binaryEncoder(out, null); dout.setSchema(schema); buffer = new NonCopyingByteArrayOutputStream( Math.min((int)(syncInterval * 1.25), Integer.MAX_VALUE/2 -1)); this.bufOut = efactory.binaryEncoder(buffer, null); if (this.codec == null) { this.codec = CodecFactory.nullCodec().createInstance(); } this.isOpen = true; } {code} It's clear here that both streams are writing to a buffered destination, {{ BufferedFileOutputStream}} and {{ByteArrayOutputStream}} therefore there is no reason to need a buffered encoder and instead, write directly to the buffered streams with {{directBinaryEncoder}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AVRO-1786) Strange IndexOutofBoundException in GenericDatumReader.readString
[ https://issues.apache.org/jira/browse/AVRO-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089823#comment-16089823 ] BELUGA BEHR commented on AVRO-1786: --- May be experiencing this issue as well trying to collect more information... > Strange IndexOutofBoundException in GenericDatumReader.readString > - > > Key: AVRO-1786 > URL: https://issues.apache.org/jira/browse/AVRO-1786 > Project: Avro > Issue Type: Bug > Components: java >Affects Versions: 1.7.4, 1.7.7 > Environment: CentOS 6.5 Linux x64, 2.6.32-358.14.1.el6.x86_64 > Use IBM JVM: > IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References > 20140515_199835 (JIT enabled, AOT enabled) >Reporter: Yong Zhang >Priority: Minor > > Our production cluster is CENTOS 6.5 (2.6.32-358.14.1.el6.x86_64), running > IBM BigInsight V3.0.0.2. In Apache term, it is Hadoop 2.2.0 with MRV1(no > yarn), and comes with AVRO 1.7.4, running with IBM J9 VM (build 2.7, JRE > 1.7.0 Linux amd64-64 Compressed References 20140515_199835 (JIT enabled, AOT > enabled). Not sure if the JDK matters, but it is NOT Oracle JVM. > We have a ETL implemented in a chain of MR jobs. In one MR job, it is going > to merge 2 sets of AVRO data. Dataset1 is in HDFS location A, and Dataset2 is > in HDFS location B, and both contains the AVRO records binding to the same > AVRO schema. The record contains an unique id field, and a timestamp field. > The MR job is to merge the records based on the ID, and use the later > timestamp record to replace previous timestamp record, and omit the final > AVRO record out. Very straightforward. > Now we faced a problem that one reducer keeps failing with the following > stacktrace on JobTracker: > {code} > java.lang.IndexOutOfBoundsException > at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:191) > at java.io.DataInputStream.read(DataInputStream.java:160) > at > org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:184) > at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263) > at > org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107) > at > org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:348) > at > org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:143) > at > org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:125) > at > org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:121) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154) > at > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139) > at > org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:108) > at > org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:48) > at > org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:142) > at > org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:117) > at > org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:297) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:165) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420) > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > at > java.security.AccessController.doPrivileged(AccessController.java:366) > at javax.security.auth.Subject.doAs(Subject.java:572) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1502) > at org.apache.hadoop.mapred.Child.main(Child.java:249) > {code} > Here is the my Mapper and Reducer methods: > Mapper: > public void map(AvroKey key, NullWritable value, Context > context) throws IOException, InterruptedException > Reducer: > protected void reduce(CustomPartitionKeyClass key, > Iterablevalues, Context context) throws > IOException, InterruptedException > What bother me are the following facts: > 1) All the mappers finish without error > 2) Most of the reducers finish without error, but one reducer keeps failing > with the above error. > 3) It looks like caused by the data? But keep in mind that all the avro > records passed the mapper side, but failed in one reducer. > 4) From the stacktrace, it looks like our reducer code was NOT invoked yet, > but failed
[jira] [Updated] (AVRO-2050) Clear Array To Allow GC
[ https://issues.apache.org/jira/browse/AVRO-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2050: -- Attachment: AVRO-2050.2.patch [~nkollar] This implementation is essentially an {{ArrayList}}. The {{ArrayList}} overwrites the {{clear}} method because using the default {{AbstractList}} implementation requires instantiating an Iterator and then deleting each item in the Iterator one at a time. This is bad performance in terms of constant stack manipulation, but also this amounts to draining the array from the head of the list. Draining from the head requires an array copy for each item removed to shift down the existing records. It is much better to override the method as {{ArrayList}} has done. However, I did see some overlap with the {{toString}} and {{add}} methods which can be leveraged. Changed the patch to remove the two overrides. > Clear Array To Allow GC > --- > > Key: AVRO-2050 > URL: https://issues.apache.org/jira/browse/AVRO-2050 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2050.1.patch, AVRO-2050.2.patch > > > Java's {{ArrayList}} implementation clears all Objects from the internal > buffer when the {{clear()}} method is called. This allows the Objects to be > free for GC. We should do the same in Avro > {{org.apache.avro.generic.GenericData}} > [ArrayList > Source|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/ArrayList.java#ArrayList.clear%28%29] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2050) Clear Array To Allow GC
[ https://issues.apache.org/jira/browse/AVRO-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2050: -- Status: Patch Available (was: Open) > Clear Array To Allow GC > --- > > Key: AVRO-2050 > URL: https://issues.apache.org/jira/browse/AVRO-2050 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.8.2, 1.7.7 >Reporter: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2050.1.patch > > > Java's {{ArrayList}} implementation clears all Objects from the internal > buffer when the {{clear()}} method is called. This allows the Objects to be > free for GC. We should do the same in Avro > {{org.apache.avro.generic.GenericData}} > [ArrayList > Source|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/ArrayList.java#ArrayList.clear%28%29] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2050) Clear Array To Allow GC
[ https://issues.apache.org/jira/browse/AVRO-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2050: -- Attachment: (was: AVRO-2050.1.patch) > Clear Array To Allow GC > --- > > Key: AVRO-2050 > URL: https://issues.apache.org/jira/browse/AVRO-2050 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2050.1.patch > > > Java's {{ArrayList}} implementation clears all Objects from the internal > buffer when the {{clear()}} method is called. This allows the Objects to be > free for GC. We should do the same in Avro > {{org.apache.avro.generic.GenericData}} > [ArrayList > Source|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/ArrayList.java#ArrayList.clear%28%29] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2050) Clear Array To Allow GC
[ https://issues.apache.org/jira/browse/AVRO-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2050: -- Attachment: AVRO-2050.1.patch > Clear Array To Allow GC > --- > > Key: AVRO-2050 > URL: https://issues.apache.org/jira/browse/AVRO-2050 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2050.1.patch > > > Java's {{ArrayList}} implementation clears all Objects from the internal > buffer when the {{clear()}} method is called. This allows the Objects to be > free for GC. We should do the same in Avro > {{org.apache.avro.generic.GenericData}} > [ArrayList > Source|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/ArrayList.java#ArrayList.clear%28%29] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2050) Clear Array To Allow GC
[ https://issues.apache.org/jira/browse/AVRO-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2050: -- Attachment: AVRO-2050.1.patch > Clear Array To Allow GC > --- > > Key: AVRO-2050 > URL: https://issues.apache.org/jira/browse/AVRO-2050 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2050.1.patch > > > Java's {{ArrayList}} implementation clears all Objects from the internal > buffer when the {{clear()}} method is called. This allows the Objects to be > free for GC. We should do the same in Avro > {{org.apache.avro.generic.GenericData}} > [ArrayList > Source|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/ArrayList.java#ArrayList.clear%28%29] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AVRO-2050) Clear Array To Allow GC
BELUGA BEHR created AVRO-2050: - Summary: Clear Array To Allow GC Key: AVRO-2050 URL: https://issues.apache.org/jira/browse/AVRO-2050 Project: Avro Issue Type: Improvement Components: java Affects Versions: 1.8.2, 1.7.7 Reporter: BELUGA BEHR Priority: Minor Java's {{ArrayList}} implementation clears all Objects from the internal buffer when the {{clear()}} method is called. This allows the Objects to be free for GC. We should do the same in Avro {{org.apache.avro.generic.GenericData}} [ArrayList Source|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/ArrayList.java#ArrayList.clear%28%29] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer
[ https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2049: -- Attachment: AVRO-2049.2.patch Updated patch to include a couple of examples of the same superfluous execution > Remove Superfluous Configuration From AvroSerializer > > > Key: AVRO-2049 > URL: https://issues.apache.org/jira/browse/AVRO-2049 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2049.1.patch, AVRO-2049.2.patch > > > In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the > Avro block size is configured with a hard-coded value and there is a request > to benchmark different buffer sizes. > {code:title=org.apache.avro.hadoop.io.AvroSerializer} > /** >* The block size for the Avro encoder. >* >* This number was copied from the AvroSerialization of > org.apache.avro.mapred in Avro 1.5.1. >* >* TODO(gwu): Do some benchmarking with different numbers here to see if it > is important. >*/ > private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512; > /** An factory for creating Avro datum encoders. */ > private static EncoderFactory mEncoderFactory > = new > EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES); > {code} > However, there is no need to benchmark, this setting is superfluous and is > ignored with the current implementation. > {code:title=org.apache.avro.hadoop.io.AvroSerializer} > @Override > public void open(OutputStream outputStream) throws IOException { > mOutputStream = outputStream; > mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder); > } > {code} > {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting. > This setting is only relevant for calls to > {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} > which considers the configured "Block Size" for doing binary encoding of > blocked Array types as laid out in the > [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex]. > It can simply be removed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer
[ https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089211#comment-16089211 ] BELUGA BEHR edited comment on AVRO-2049 at 7/17/17 1:21 AM: Updated patch to include a couple of other instances of the same superfluous execution was (Author: belugabehr): Updated patch to include a couple of examples of the same superfluous execution > Remove Superfluous Configuration From AvroSerializer > > > Key: AVRO-2049 > URL: https://issues.apache.org/jira/browse/AVRO-2049 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2049.1.patch, AVRO-2049.2.patch > > > In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the > Avro block size is configured with a hard-coded value and there is a request > to benchmark different buffer sizes. > {code:title=org.apache.avro.hadoop.io.AvroSerializer} > /** >* The block size for the Avro encoder. >* >* This number was copied from the AvroSerialization of > org.apache.avro.mapred in Avro 1.5.1. >* >* TODO(gwu): Do some benchmarking with different numbers here to see if it > is important. >*/ > private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512; > /** An factory for creating Avro datum encoders. */ > private static EncoderFactory mEncoderFactory > = new > EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES); > {code} > However, there is no need to benchmark, this setting is superfluous and is > ignored with the current implementation. > {code:title=org.apache.avro.hadoop.io.AvroSerializer} > @Override > public void open(OutputStream outputStream) throws IOException { > mOutputStream = outputStream; > mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder); > } > {code} > {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting. > This setting is only relevant for calls to > {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} > which considers the configured "Block Size" for doing binary encoding of > blocked Array types as laid out in the > [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex]. > It can simply be removed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer
[ https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2049: -- Status: Patch Available (was: Open) > Remove Superfluous Configuration From AvroSerializer > > > Key: AVRO-2049 > URL: https://issues.apache.org/jira/browse/AVRO-2049 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.8.2, 1.7.7 >Reporter: BELUGA BEHR >Priority: Trivial > Attachments: AVRO-2049.1.patch > > > In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the > Avro block size is configured with a hard-coded value and there is a request > to benchmark different buffer sizes. > {code:title=org.apache.avro.hadoop.io.AvroSerializer} > /** >* The block size for the Avro encoder. >* >* This number was copied from the AvroSerialization of > org.apache.avro.mapred in Avro 1.5.1. >* >* TODO(gwu): Do some benchmarking with different numbers here to see if it > is important. >*/ > private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512; > /** An factory for creating Avro datum encoders. */ > private static EncoderFactory mEncoderFactory > = new > EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES); > {code} > However, there is no need to benchmark, this setting is superfluous and is > ignored with the current implementation. > {code:title=org.apache.avro.hadoop.io.AvroSerializer} > @Override > public void open(OutputStream outputStream) throws IOException { > mOutputStream = outputStream; > mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder); > } > {code} > {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting. > This setting is only relevant for calls to > {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} > which considers the configured "Block Size" for doing binary encoding of > blocked Array types as laid out in the > [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex]. > It can simply be removed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer
BELUGA BEHR created AVRO-2049: - Summary: Remove Superfluous Configuration From AvroSerializer Key: AVRO-2049 URL: https://issues.apache.org/jira/browse/AVRO-2049 Project: Avro Issue Type: Improvement Components: java Affects Versions: 1.8.2, 1.7.7 Reporter: BELUGA BEHR Priority: Trivial In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the Avro block size is configured with a hard-coded value and there is a request to benchmark different buffer sizes. {code:title=org.apache.avro.hadoop.io.AvroSerializer} /** * The block size for the Avro encoder. * * This number was copied from the AvroSerialization of org.apache.avro.mapred in Avro 1.5.1. * * TODO(gwu): Do some benchmarking with different numbers here to see if it is important. */ private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512; /** An factory for creating Avro datum encoders. */ private static EncoderFactory mEncoderFactory = new EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES); {code} However, there is no need to benchmark, this setting is superfluous and is ignored with the current implementation. {code:title=org.apache.avro.hadoop.io.AvroSerializer} @Override public void open(OutputStream outputStream) throws IOException { mOutputStream = outputStream; mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder); } {code} {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting. This setting is only relevant for calls to {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} which considers the configured "Block Size" for doing binary encoding of blocked Array types as laid out in the [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex]. It can simply be removed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings
[ https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2048: -- Description: According to the [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]: bq. a string is encoded as a *long* followed by that many bytes of UTF-8 encoded character data. However, that is currently not being adhered to: {code:title=org.apache.avro.io.BinaryDecoder} @Override public Utf8 readString(Utf8 old) throws IOException { int length = readInt(); Utf8 result = (old != null ? old : new Utf8()); result.setByteLength(length); if (0 != length) { doReadBytes(result.getBytes(), 0, length); } return result; } {code} The first thing the code does here is to load an *int* value, not a *long*. Because of the variable length nature of the size, this will mostly work. However, there may be edge-cases where the serializer is putting in large length values erroneously or nefariously. Let us gracefully detect such scenarios and more closely adhere to the spec. was: According to the [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]: bq. a string is encoded as a *long* followed by that many bytes of UTF-8 encoded character data. However, that is currently not being adhered to: {code:title=org.apache.avro.io.BinaryDecoder} @Override public Utf8 readString(Utf8 old) throws IOException { int length = readInt(); Utf8 result = (old != null ? old : new Utf8()); result.setByteLength(length); if (0 != length) { doReadBytes(result.getBytes(), 0, length); } return result; } {code} The first thing the code does here is to load an *int* value, not a *long*. Because of the variable length nature of the size, this will mostly work. However, there may be edge-cases where this is broken and the serializer is putting in large values erroneously or nefariously. Let us gracefully handle to detect such scenarios and more closely adhere to the spec. > Avro Binary Decoding - Gracefully Handle Long Strings > - > > Key: AVRO-2048 > URL: https://issues.apache.org/jira/browse/AVRO-2048 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2048.1.patch > > > According to the > [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]: > bq. a string is encoded as a *long* followed by that many bytes of UTF-8 > encoded character data. > However, that is currently not being adhered to: > {code:title=org.apache.avro.io.BinaryDecoder} > @Override > public Utf8 readString(Utf8 old) throws IOException { > int length = readInt(); > Utf8 result = (old != null ? old : new Utf8()); > result.setByteLength(length); > if (0 != length) { > doReadBytes(result.getBytes(), 0, length); > } > return result; > } > {code} > The first thing the code does here is to load an *int* value, not a *long*. > Because of the variable length nature of the size, this will mostly work. > However, there may be edge-cases where the serializer is putting in large > length values erroneously or nefariously. Let us gracefully detect such > scenarios and more closely adhere to the spec. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings
[ https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2048: -- Attachment: AVRO-2048.1.patch > Avro Binary Decoding - Gracefully Handle Long Strings > - > > Key: AVRO-2048 > URL: https://issues.apache.org/jira/browse/AVRO-2048 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.7, 1.8.2 >Reporter: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2048.1.patch > > > According to the > [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]: > bq. a string is encoded as a *long* followed by that many bytes of UTF-8 > encoded character data. > However, that is currently not being adhered to: > {code:title=org.apache.avro.io.BinaryDecoder} > @Override > public Utf8 readString(Utf8 old) throws IOException { > int length = readInt(); > Utf8 result = (old != null ? old : new Utf8()); > result.setByteLength(length); > if (0 != length) { > doReadBytes(result.getBytes(), 0, length); > } > return result; > } > {code} > The first thing the code does here is to load an *int* value, not a *long*. > Because of the variable length nature of the size, this will mostly work. > However, there may be edge-cases where this is broken and the serializer is > putting in large values erroneously or nefariously. Let us gracefully handle > to detect such scenarios and more closely adhere to the spec. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings
[ https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated AVRO-2048: -- Status: Patch Available (was: Open) > Avro Binary Decoding - Gracefully Handle Long Strings > - > > Key: AVRO-2048 > URL: https://issues.apache.org/jira/browse/AVRO-2048 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.8.2, 1.7.7 >Reporter: BELUGA BEHR >Priority: Minor > Attachments: AVRO-2048.1.patch > > > According to the > [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]: > bq. a string is encoded as a *long* followed by that many bytes of UTF-8 > encoded character data. > However, that is currently not being adhered to: > {code:title=org.apache.avro.io.BinaryDecoder} > @Override > public Utf8 readString(Utf8 old) throws IOException { > int length = readInt(); > Utf8 result = (old != null ? old : new Utf8()); > result.setByteLength(length); > if (0 != length) { > doReadBytes(result.getBytes(), 0, length); > } > return result; > } > {code} > The first thing the code does here is to load an *int* value, not a *long*. > Because of the variable length nature of the size, this will mostly work. > However, there may be edge-cases where this is broken and the serializer is > putting in large values erroneously or nefariously. Let us gracefully handle > to detect such scenarios and more closely adhere to the spec. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings
BELUGA BEHR created AVRO-2048: - Summary: Avro Binary Decoding - Gracefully Handle Long Strings Key: AVRO-2048 URL: https://issues.apache.org/jira/browse/AVRO-2048 Project: Avro Issue Type: Improvement Components: java Affects Versions: 1.8.2, 1.7.7 Reporter: BELUGA BEHR Priority: Minor According to the [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]: bq. a string is encoded as a *long* followed by that many bytes of UTF-8 encoded character data. However, that is currently not being adhered to: {code:title=org.apache.avro.io.BinaryDecoder} @Override public Utf8 readString(Utf8 old) throws IOException { int length = readInt(); Utf8 result = (old != null ? old : new Utf8()); result.setByteLength(length); if (0 != length) { doReadBytes(result.getBytes(), 0, length); } return result; } {code} The first thing the code does here is to load an *int* value, not a *long*. Because of the variable length nature of the size, this will mostly work. However, there may be edge-cases where this is broken and the serializer is putting in large values erroneously or nefariously. Let us gracefully handle to detect such scenarios and more closely adhere to the spec. -- This message was sent by Atlassian JIRA (v6.4.14#64029)