[jira] [Assigned] (AVRO-2310) GenericData Array Add Use JDK Copy Facilities

2019-01-28 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/AVRO-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR reassigned AVRO-2310:
-

Assignee: BELUGA BEHR

> GenericData Array Add Use JDK Copy Facilities
> -
>
> Key: AVRO-2310
> URL: https://issues.apache.org/jira/browse/AVRO-2310
> Project: Apache Avro
>  Issue Type: Improvement
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
>
> Use {{Arrays.copyOf}} to increase readability



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AVRO-2310) GenericData Array Add Use JDK Copy Facilities

2019-01-28 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/AVRO-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2310:
--
Summary: GenericData Array Add Use JDK Copy Facilities  (was: GenericData 
Array Add Use Latest Facilities)

> GenericData Array Add Use JDK Copy Facilities
> -
>
> Key: AVRO-2310
> URL: https://issues.apache.org/jira/browse/AVRO-2310
> Project: Apache Avro
>  Issue Type: Improvement
>Reporter: BELUGA BEHR
>Priority: Trivial
>
> Use {{Arrays.copyOf}} to increase readability



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AVRO-2310) GenericData Array Add Use Latest Facilities

2019-01-28 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created AVRO-2310:
-

 Summary: GenericData Array Add Use Latest Facilities
 Key: AVRO-2310
 URL: https://issues.apache.org/jira/browse/AVRO-2310
 Project: Apache Avro
  Issue Type: Improvement
Reporter: BELUGA BEHR


Use {{Arrays.copyOf}} to increase readability



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AVRO-2309) Conversions.java - Use JDK Array Manipulations

2019-01-28 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/AVRO-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2309:
--
Summary: Conversions.java - Use JDK Array Manipulations  (was: 
Conversions.java - Use Fill Method)

> Conversions.java - Use JDK Array Manipulations
> --
>
> Key: AVRO-2309
> URL: https://issues.apache.org/jira/browse/AVRO-2309
> Project: Apache Avro
>  Issue Type: Improvement
>Reporter: BELUGA BEHR
>Priority: Trivial
>
> {code:java|title=Conversions.java}
>   for (int i = 0; i < bytes.length; i += 1) {
> if (i < offset) {
>   bytes[i] = fillByte;
> } else {
>   bytes[i] = unscaled[i - offset];
> }
>   }
> {code}
> Improve readability with {{Arrays.fill}} method to fill the first {{offset}} 
> bytes and then copy over the remaining.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AVRO-2308) Use JDK1.7 StandardCharsets

2019-01-27 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created AVRO-2308:
-

 Summary: Use JDK1.7 StandardCharsets
 Key: AVRO-2308
 URL: https://issues.apache.org/jira/browse/AVRO-2308
 Project: Apache Avro
  Issue Type: Improvement
Reporter: BELUGA BEHR


Use Java 1.7 
[StandardCharsets|https://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html].
  Every JDK must now include support for several common charsets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AVRO-2061) Improve Invalid File Format Error Message

2017-08-08 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16118953#comment-16118953
 ] 

BELUGA BEHR commented on AVRO-2061:
---

[~nkollar] [~cutting] There ya go!  Patch submitted. :) Thanks.

> Improve Invalid File Format Error Message
> -
>
> Key: AVRO-2061
> URL: https://issues.apache.org/jira/browse/AVRO-2061
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.9.0, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2061.1.patch, AVRO-2061.2.patch
>
>
> {code:title=org.apache.avro.file.DataFileStream}
> try {
>   vin.readFixed(magic); // read magic
> } catch (IOException e) {
>   throw new IOException("Not a data file.", e);
> }
> if (!Arrays.equals(DataFileConstants.MAGIC, magic))
>   throw new IOException("Not a data file.");
> {code}
> Please consider improving the error message here.  I just saw a MapReduce job 
> fail with an IOException with the message "Not a data file."  There was 
> definitely data files in the input directory, however, they were not Avro 
> files. It would have been much more helpful if it told me that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2061) Improve Invalid File Format Error Message

2017-08-08 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2061:
--
Attachment: AVRO-2061.2.patch

> Improve Invalid File Format Error Message
> -
>
> Key: AVRO-2061
> URL: https://issues.apache.org/jira/browse/AVRO-2061
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.9.0, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2061.1.patch, AVRO-2061.2.patch
>
>
> {code:title=org.apache.avro.file.DataFileStream}
> try {
>   vin.readFixed(magic); // read magic
> } catch (IOException e) {
>   throw new IOException("Not a data file.", e);
> }
> if (!Arrays.equals(DataFileConstants.MAGIC, magic))
>   throw new IOException("Not a data file.");
> {code}
> Please consider improving the error message here.  I just saw a MapReduce job 
> fail with an IOException with the message "Not a data file."  There was 
> definitely data files in the input directory, however, they were not Avro 
> files. It would have been much more helpful if it told me that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (AVRO-1786) Strange IndexOutofBoundException in GenericDatumReader.readString

2017-08-04 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16114574#comment-16114574
 ] 

BELUGA BEHR edited comment on AVRO-1786 at 8/4/17 4:17 PM:
---

In regards to the original case, it is interesting to note that the Reducer 
failed while [deserializing the 
key|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/ReduceContextImpl.java#L142].
  However, the Map output key is of type 
{{mergeJob.setMapOutputKeyClass(Contact2MergePartitionKey.class)}} which is not 
an {{AvroKey}} class.  So, I'm not sure why the Avro deserializer is in play 
here at all.


was (Author: belugabehr):
In regards to the original case, it is interesting to note that the Reducer 
failed while [deserializing the 
key|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/ReduceContextImpl.java#L142].
  However, the Map output key is of type 
{{mergeJob.setMapOutputKeyClass(Contact2MergePartitionKey.class)}} which is not 
an {{AvroKey}} class.

> Strange IndexOutofBoundException in GenericDatumReader.readString
> -
>
> Key: AVRO-1786
> URL: https://issues.apache.org/jira/browse/AVRO-1786
> Project: Avro
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.7.4, 1.7.7
> Environment: CentOS 6.5 Linux x64, 2.6.32-358.14.1.el6.x86_64
> Use IBM JVM:
> IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References 
> 20140515_199835 (JIT enabled, AOT enabled)
>Reporter: Yong Zhang
>Priority: Minor
>
> Our production cluster is CENTOS 6.5 (2.6.32-358.14.1.el6.x86_64), running 
> IBM BigInsight V3.0.0.2. In Apache term, it is Hadoop 2.2.0 with MRV1(no 
> yarn), and comes with AVRO 1.7.4, running with IBM J9 VM (build 2.7, JRE 
> 1.7.0 Linux amd64-64 Compressed References 20140515_199835 (JIT enabled, AOT 
> enabled). Not sure if the JDK matters, but it is NOT Oracle JVM.
> We have a ETL implemented in a chain of MR jobs. In one MR job, it is going 
> to merge 2 sets of AVRO data. Dataset1 is in HDFS location A, and Dataset2 is 
> in HDFS location B, and both contains the AVRO records binding to the same 
> AVRO schema. The record contains an unique id field, and a timestamp field. 
> The MR job is to merge the records based on the ID, and use the later 
> timestamp record to replace previous timestamp record, and omit the final 
> AVRO record out. Very straightforward.
> Now we faced a problem that one reducer keeps failing with the following 
> stacktrace on JobTracker:
> {code}
> java.lang.IndexOutOfBoundsException
>   at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:191)
>   at java.io.DataInputStream.read(DataInputStream.java:160)
>   at 
> org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:184)
>   at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
>   at 
> org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107)
>   at 
> org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:348)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:143)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:125)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:121)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154)
>   at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139)
>   at 
> org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:108)
>   at 
> org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:48)
>   at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:142)
>   at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:117)
>   at 
> org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:297)
>   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:165)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at 

[jira] [Commented] (AVRO-1786) Strange IndexOutofBoundException in GenericDatumReader.readString

2017-08-04 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16114574#comment-16114574
 ] 

BELUGA BEHR commented on AVRO-1786:
---

In regards to the original case, it is interesting to note that the Reducer 
failed while [deserializing the 
key|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/ReduceContextImpl.java#L142].
  However, the Map output key is of type 
{{mergeJob.setMapOutputKeyClass(Contact2MergePartitionKey.class}} which is not 
an Avro class.

> Strange IndexOutofBoundException in GenericDatumReader.readString
> -
>
> Key: AVRO-1786
> URL: https://issues.apache.org/jira/browse/AVRO-1786
> Project: Avro
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.7.4, 1.7.7
> Environment: CentOS 6.5 Linux x64, 2.6.32-358.14.1.el6.x86_64
> Use IBM JVM:
> IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References 
> 20140515_199835 (JIT enabled, AOT enabled)
>Reporter: Yong Zhang
>Priority: Minor
>
> Our production cluster is CENTOS 6.5 (2.6.32-358.14.1.el6.x86_64), running 
> IBM BigInsight V3.0.0.2. In Apache term, it is Hadoop 2.2.0 with MRV1(no 
> yarn), and comes with AVRO 1.7.4, running with IBM J9 VM (build 2.7, JRE 
> 1.7.0 Linux amd64-64 Compressed References 20140515_199835 (JIT enabled, AOT 
> enabled). Not sure if the JDK matters, but it is NOT Oracle JVM.
> We have a ETL implemented in a chain of MR jobs. In one MR job, it is going 
> to merge 2 sets of AVRO data. Dataset1 is in HDFS location A, and Dataset2 is 
> in HDFS location B, and both contains the AVRO records binding to the same 
> AVRO schema. The record contains an unique id field, and a timestamp field. 
> The MR job is to merge the records based on the ID, and use the later 
> timestamp record to replace previous timestamp record, and omit the final 
> AVRO record out. Very straightforward.
> Now we faced a problem that one reducer keeps failing with the following 
> stacktrace on JobTracker:
> {code}
> java.lang.IndexOutOfBoundsException
>   at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:191)
>   at java.io.DataInputStream.read(DataInputStream.java:160)
>   at 
> org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:184)
>   at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
>   at 
> org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107)
>   at 
> org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:348)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:143)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:125)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:121)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154)
>   at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139)
>   at 
> org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:108)
>   at 
> org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:48)
>   at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:142)
>   at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:117)
>   at 
> org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:297)
>   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:165)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at 
> java.security.AccessController.doPrivileged(AccessController.java:366)
>   at javax.security.auth.Subject.doAs(Subject.java:572)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1502)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> {code}
> Here is the my Mapper and Reducer methods:
> Mapper:
> public void map(AvroKey key, NullWritable value, Context 
> context) throws IOException, InterruptedException 
> Reducer:
> protected void reduce(CustomPartitionKeyClass key, 
> Iterable values, Context context) throws 
> IOException, InterruptedException 
> What 

[jira] [Comment Edited] (AVRO-1786) Strange IndexOutofBoundException in GenericDatumReader.readString

2017-08-04 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16114574#comment-16114574
 ] 

BELUGA BEHR edited comment on AVRO-1786 at 8/4/17 4:16 PM:
---

In regards to the original case, it is interesting to note that the Reducer 
failed while [deserializing the 
key|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/ReduceContextImpl.java#L142].
  However, the Map output key is of type 
{{mergeJob.setMapOutputKeyClass(Contact2MergePartitionKey.class}} which is not 
an {{AvroKey}} class.


was (Author: belugabehr):
In regards to the original case, it is interesting to note that the Reducer 
failed while [deserializing the 
key|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/ReduceContextImpl.java#L142].
  However, the Map output key is of type 
{{mergeJob.setMapOutputKeyClass(Contact2MergePartitionKey.class}} which is not 
an Avro class.

> Strange IndexOutofBoundException in GenericDatumReader.readString
> -
>
> Key: AVRO-1786
> URL: https://issues.apache.org/jira/browse/AVRO-1786
> Project: Avro
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.7.4, 1.7.7
> Environment: CentOS 6.5 Linux x64, 2.6.32-358.14.1.el6.x86_64
> Use IBM JVM:
> IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References 
> 20140515_199835 (JIT enabled, AOT enabled)
>Reporter: Yong Zhang
>Priority: Minor
>
> Our production cluster is CENTOS 6.5 (2.6.32-358.14.1.el6.x86_64), running 
> IBM BigInsight V3.0.0.2. In Apache term, it is Hadoop 2.2.0 with MRV1(no 
> yarn), and comes with AVRO 1.7.4, running with IBM J9 VM (build 2.7, JRE 
> 1.7.0 Linux amd64-64 Compressed References 20140515_199835 (JIT enabled, AOT 
> enabled). Not sure if the JDK matters, but it is NOT Oracle JVM.
> We have a ETL implemented in a chain of MR jobs. In one MR job, it is going 
> to merge 2 sets of AVRO data. Dataset1 is in HDFS location A, and Dataset2 is 
> in HDFS location B, and both contains the AVRO records binding to the same 
> AVRO schema. The record contains an unique id field, and a timestamp field. 
> The MR job is to merge the records based on the ID, and use the later 
> timestamp record to replace previous timestamp record, and omit the final 
> AVRO record out. Very straightforward.
> Now we faced a problem that one reducer keeps failing with the following 
> stacktrace on JobTracker:
> {code}
> java.lang.IndexOutOfBoundsException
>   at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:191)
>   at java.io.DataInputStream.read(DataInputStream.java:160)
>   at 
> org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:184)
>   at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
>   at 
> org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107)
>   at 
> org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:348)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:143)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:125)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:121)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154)
>   at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139)
>   at 
> org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:108)
>   at 
> org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:48)
>   at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:142)
>   at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:117)
>   at 
> org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:297)
>   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:165)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at 
> java.security.AccessController.doPrivileged(AccessController.java:366)
>   

[jira] [Comment Edited] (AVRO-1786) Strange IndexOutofBoundException in GenericDatumReader.readString

2017-08-04 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16114574#comment-16114574
 ] 

BELUGA BEHR edited comment on AVRO-1786 at 8/4/17 4:16 PM:
---

In regards to the original case, it is interesting to note that the Reducer 
failed while [deserializing the 
key|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/ReduceContextImpl.java#L142].
  However, the Map output key is of type 
{{mergeJob.setMapOutputKeyClass(Contact2MergePartitionKey.class)}} which is not 
an {{AvroKey}} class.


was (Author: belugabehr):
In regards to the original case, it is interesting to note that the Reducer 
failed while [deserializing the 
key|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/ReduceContextImpl.java#L142].
  However, the Map output key is of type 
{{mergeJob.setMapOutputKeyClass(Contact2MergePartitionKey.class}} which is not 
an {{AvroKey}} class.

> Strange IndexOutofBoundException in GenericDatumReader.readString
> -
>
> Key: AVRO-1786
> URL: https://issues.apache.org/jira/browse/AVRO-1786
> Project: Avro
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.7.4, 1.7.7
> Environment: CentOS 6.5 Linux x64, 2.6.32-358.14.1.el6.x86_64
> Use IBM JVM:
> IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References 
> 20140515_199835 (JIT enabled, AOT enabled)
>Reporter: Yong Zhang
>Priority: Minor
>
> Our production cluster is CENTOS 6.5 (2.6.32-358.14.1.el6.x86_64), running 
> IBM BigInsight V3.0.0.2. In Apache term, it is Hadoop 2.2.0 with MRV1(no 
> yarn), and comes with AVRO 1.7.4, running with IBM J9 VM (build 2.7, JRE 
> 1.7.0 Linux amd64-64 Compressed References 20140515_199835 (JIT enabled, AOT 
> enabled). Not sure if the JDK matters, but it is NOT Oracle JVM.
> We have a ETL implemented in a chain of MR jobs. In one MR job, it is going 
> to merge 2 sets of AVRO data. Dataset1 is in HDFS location A, and Dataset2 is 
> in HDFS location B, and both contains the AVRO records binding to the same 
> AVRO schema. The record contains an unique id field, and a timestamp field. 
> The MR job is to merge the records based on the ID, and use the later 
> timestamp record to replace previous timestamp record, and omit the final 
> AVRO record out. Very straightforward.
> Now we faced a problem that one reducer keeps failing with the following 
> stacktrace on JobTracker:
> {code}
> java.lang.IndexOutOfBoundsException
>   at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:191)
>   at java.io.DataInputStream.read(DataInputStream.java:160)
>   at 
> org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:184)
>   at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
>   at 
> org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107)
>   at 
> org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:348)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:143)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:125)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:121)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154)
>   at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139)
>   at 
> org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:108)
>   at 
> org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:48)
>   at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:142)
>   at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:117)
>   at 
> org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:297)
>   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:165)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at 
> 

[jira] [Assigned] (AVRO-2061) Improve Invalid File Format Error Message

2017-08-03 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR reassigned AVRO-2061:
-

Assignee: BELUGA BEHR

> Improve Invalid File Format Error Message
> -
>
> Key: AVRO-2061
> URL: https://issues.apache.org/jira/browse/AVRO-2061
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.9.0, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2061.1.patch
>
>
> {code:title=org.apache.avro.file.DataFileStream}
> try {
>   vin.readFixed(magic); // read magic
> } catch (IOException e) {
>   throw new IOException("Not a data file.", e);
> }
> if (!Arrays.equals(DataFileConstants.MAGIC, magic))
>   throw new IOException("Not a data file.");
> {code}
> Please consider improving the error message here.  I just saw a MapReduce job 
> fail with an IOException with the message "Not a data file."  There was 
> definitely data files in the input directory, however, they were not Avro 
> files. It would have been much more helpful if it told me that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2061) Improve Invalid File Format Error Message

2017-08-03 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2061:
--
Status: Patch Available  (was: Open)

> Improve Invalid File Format Error Message
> -
>
> Key: AVRO-2061
> URL: https://issues.apache.org/jira/browse/AVRO-2061
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.8.2, 1.9.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2061.1.patch
>
>
> {code:title=org.apache.avro.file.DataFileStream}
> try {
>   vin.readFixed(magic); // read magic
> } catch (IOException e) {
>   throw new IOException("Not a data file.", e);
> }
> if (!Arrays.equals(DataFileConstants.MAGIC, magic))
>   throw new IOException("Not a data file.");
> {code}
> Please consider improving the error message here.  I just saw a MapReduce job 
> fail with an IOException with the message "Not a data file."  There was 
> definitely data files in the input directory, however, they were not Avro 
> files. It would have been much more helpful if it told me that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2061) Improve Invalid File Format Error Message

2017-08-03 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2061:
--
Attachment: AVRO-2061.1.patch

> Improve Invalid File Format Error Message
> -
>
> Key: AVRO-2061
> URL: https://issues.apache.org/jira/browse/AVRO-2061
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.9.0, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2061.1.patch
>
>
> {code:title=org.apache.avro.file.DataFileStream}
> try {
>   vin.readFixed(magic); // read magic
> } catch (IOException e) {
>   throw new IOException("Not a data file.", e);
> }
> if (!Arrays.equals(DataFileConstants.MAGIC, magic))
>   throw new IOException("Not a data file.");
> {code}
> Please consider improving the error message here.  I just saw a MapReduce job 
> fail with an IOException with the message "Not a data file."  There was 
> definitely data files in the input directory, however, they were not Avro 
> files. It would have been much more helpful if it told me that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AVRO-1786) Strange IndexOutofBoundException in GenericDatumReader.readString

2017-08-02 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112014#comment-16112014
 ] 

BELUGA BEHR commented on AVRO-1786:
---

[~cutting] [~rdblue] I've been tearing through the code of Avro and MapReduce 
to find a chink in the armor, but not luck.  Any thoughts on where I can focus 
my search?

> Strange IndexOutofBoundException in GenericDatumReader.readString
> -
>
> Key: AVRO-1786
> URL: https://issues.apache.org/jira/browse/AVRO-1786
> Project: Avro
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.7.4, 1.7.7
> Environment: CentOS 6.5 Linux x64, 2.6.32-358.14.1.el6.x86_64
> Use IBM JVM:
> IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References 
> 20140515_199835 (JIT enabled, AOT enabled)
>Reporter: Yong Zhang
>Priority: Minor
>
> Our production cluster is CENTOS 6.5 (2.6.32-358.14.1.el6.x86_64), running 
> IBM BigInsight V3.0.0.2. In Apache term, it is Hadoop 2.2.0 with MRV1(no 
> yarn), and comes with AVRO 1.7.4, running with IBM J9 VM (build 2.7, JRE 
> 1.7.0 Linux amd64-64 Compressed References 20140515_199835 (JIT enabled, AOT 
> enabled). Not sure if the JDK matters, but it is NOT Oracle JVM.
> We have a ETL implemented in a chain of MR jobs. In one MR job, it is going 
> to merge 2 sets of AVRO data. Dataset1 is in HDFS location A, and Dataset2 is 
> in HDFS location B, and both contains the AVRO records binding to the same 
> AVRO schema. The record contains an unique id field, and a timestamp field. 
> The MR job is to merge the records based on the ID, and use the later 
> timestamp record to replace previous timestamp record, and omit the final 
> AVRO record out. Very straightforward.
> Now we faced a problem that one reducer keeps failing with the following 
> stacktrace on JobTracker:
> {code}
> java.lang.IndexOutOfBoundsException
>   at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:191)
>   at java.io.DataInputStream.read(DataInputStream.java:160)
>   at 
> org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:184)
>   at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
>   at 
> org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107)
>   at 
> org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:348)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:143)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:125)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:121)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154)
>   at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139)
>   at 
> org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:108)
>   at 
> org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:48)
>   at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:142)
>   at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:117)
>   at 
> org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:297)
>   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:165)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at 
> java.security.AccessController.doPrivileged(AccessController.java:366)
>   at javax.security.auth.Subject.doAs(Subject.java:572)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1502)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> {code}
> Here is the my Mapper and Reducer methods:
> Mapper:
> public void map(AvroKey key, NullWritable value, Context 
> context) throws IOException, InterruptedException 
> Reducer:
> protected void reduce(CustomPartitionKeyClass key, 
> Iterable values, Context context) throws 
> IOException, InterruptedException 
> What bother me are the following facts:
> 1) All the mappers finish without error
> 2) Most of the reducers finish without error, but one reducer keeps failing 
> with the above error.
> 3) It looks like caused by the data? But keep in mind that all the avro 
> records passed the mapper side, but failed in one reducer. 
> 

[jira] [Comment Edited] (AVRO-1786) Strange IndexOutofBoundException in GenericDatumReader.readString

2017-08-01 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16110083#comment-16110083
 ] 

BELUGA BEHR edited comment on AVRO-1786 at 8/2/17 1:10 AM:
---

I created a Fuzzer to test random combinations of values to try to reproduce 
this issue, but no luck.  During my investigation, I hit [HADOOP-11678] because 
I forgot to set the Serializer to use 
{{org.apache.avro.hadoop.io.AvroSerialization}} and I was instead getting the 
default Hadoop Avro Serialization class of 
{{org.apache.hadoop.io.serializer.avro.AvroSerialization}} which is very broken 
because it's erroneously buffering data.  However, during the time I was not 
setting the Serialization class correctly, I was getting some failures with 
stack traces that are very similar to the one reported here.  So, it makes me 
think that there may be a buffer that isn't being flushed during serialization. 
 Once i corrected the mistake, my Fuzzer ran for several hours and did not 
encounter any issues serializing or deserializing the random data.  I was 
serializing the data directly to an 
[IFile|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/IFile.java]
 Writer and then using the IFile Reader to deserialize the back into objects.

We are observing the same behaviors:

# All Mappers finish without error
# Most of the reducers finish without error, but one reducer keeps failing with 
the above error.

Interestingly, our Map Input and Output are the same classes.  That is to say, 
the Mappers filter and normalize the data, they don't create it.  The 
{{AvroKey}} and {{AvroValue}} parameters passed to the Mapper are then returned 
by the Mapper.  There are no {{AvroKey}} or {{AvroValue}} classes instantiated 
in the Mapper.  They are passed directly through the Map method.  This 
demonstrates that the entire data set can be read by the Avro serialization 
framework, but something goes wrong during the intermediate file stage.  This 
may be an issue with how the Hadoop MapReduce framework interacts with Avro's 
library.


was (Author: belugabehr):
I created a Fuzzer to test random combinations of values to try to reproduce 
this issue, but no luck.  During my investigation, I hit [HADOOP-11678] because 
I forgot to set the Serializer to use 
{{org.apache.avro.hadoop.io.AvroSerialization}} and I was instead getting the 
default Hadoop Avro Serialization class of 
{{org.apache.hadoop.io.serializer.avro.AvroSerialization}} which is very broken 
because it's erroneously buffering data.  However, during the time I was not 
setting the Serialization class correctly, I was getting some very similar 
errors that is recorded here.  So, it makes me think that there may be a buffer 
that isn't being flushed during serialization.  Once i corrected the mistake, 
my Fuzzer ran for several hours and did not encounter any issues serializing or 
deserializing the random data.  I was serializing the data directly to an 
[IFile|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/IFile.java]
 Reader and then using the IFile Reader to deserialize the back into objects.

We are observing the same behaviors:

# All Mappers finish without error
# Most of the reducers finish without error, but one reducer keeps failing with 
the above error.

Interestingly, our Map Input and Output are the same classes.  That is to say, 
the Mappers filter and normalize the data, they don't create it.  The 
{{AvroKey}} and {{AvroValue}} parameters passed to the Mapper are then returned 
by the Mapper.  There are no {{AvroKey}} or {{AvroValue}} classes instantiated 
in the Mapper.  They are passed directly through the Map method.  This 
demonstrates that the entire data set can be read by the Avro serialization 
framework, but something goes wrong during the intermediate file stage.  This 
may be an issue with how the Hadoop MapReduce framework interacts with Avro's 
library.

> Strange IndexOutofBoundException in GenericDatumReader.readString
> -
>
> Key: AVRO-1786
> URL: https://issues.apache.org/jira/browse/AVRO-1786
> Project: Avro
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.7.4, 1.7.7
> Environment: CentOS 6.5 Linux x64, 2.6.32-358.14.1.el6.x86_64
> Use IBM JVM:
> IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References 
> 20140515_199835 (JIT enabled, AOT enabled)
>Reporter: Yong Zhang
>Priority: Minor
>
> Our production cluster is CENTOS 6.5 (2.6.32-358.14.1.el6.x86_64), running 
> IBM BigInsight 

[jira] [Comment Edited] (AVRO-1786) Strange IndexOutofBoundException in GenericDatumReader.readString

2017-08-01 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16110083#comment-16110083
 ] 

BELUGA BEHR edited comment on AVRO-1786 at 8/2/17 1:08 AM:
---

I created a Fuzzer to test random combinations of values to try to reproduce 
this issue, but no luck.  During my investigation, I hit [HADOOP-11678] because 
I forgot to set the Serializer to use 
{{org.apache.avro.hadoop.io.AvroSerialization}} and I was instead getting the 
default Hadoop Avro Serialization class of 
{{org.apache.hadoop.io.serializer.avro.AvroSerialization}} which is very broken 
because it's erroneously buffering data.  However, during the time I was not 
setting the Serialization class correctly, I was getting some very similar 
errors that is recorded here.  So, it makes me think that there may be a buffer 
that isn't being flushed during serialization.  Once i corrected the mistake, 
my Fuzzer ran for several hours and did not encounter any issues serializing or 
deserializing the random data.  I was serializing the data directly to an 
[IFile|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/IFile.java]
 Reader and then using the IFile Reader to deserialize the back into objects.

We are observing the same behaviors:

# All Mappers finish without error
# Most of the reducers finish without error, but one reducer keeps failing with 
the above error.

Interestingly, our Map Input and Output are the same classes.  That is to say, 
the Mappers filter and normalize the data, they don't create it.  The 
{{AvroKey}} and {{AvroValue}} parameters passed to the Mapper are then returned 
by the Mapper.  There are no {{AvroKey}} or {{AvroValue}} classes instantiated 
in the Mapper.  They are passed directly through the Map method.  This 
demonstrates that the entire data set can be read by the Avro serialization 
framework, but something goes wrong during the intermediate file stage.  This 
may be an issue with how the Hadoop MapReduce framework interacts with Avro's 
library.


was (Author: belugabehr):
I created a Fuzzer to test random combinations of values to try to reproduce 
this issue, but no luck.  During my investigation, I hit [HADOOP-11678] because 
I forgot to set the Serializer to use 
{{org.apache.avro.hadoop.io;AvroSerialization}} and I was instead getting the 
default Hadoop Avro Serialization class of 
{{org.apache.hadoop.io.serializer.avro.AvroSerialization}} which is very broken 
because it's erroneously buffering data.  However, during the time I was not 
setting the Serialization class correctly, I was getting some very similar 
errors that is recorded here.  So, it makes me think that there may be a buffer 
that isn't being flushed during serialization.  Once i corrected the mistake, 
my Fuzzer ran for several hours and did not encounter any issues serializing or 
deserializing the random data.  I was serializing the data directly to an 
[IFile|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/IFile.java]
 Reader and then using the IFile Reader to deserialize the back into objects.

We are observing the same behaviors:

# All Mappers finish without error
# Most of the reducers finish without error, but one reducer keeps failing with 
the above error.

Interestingly, our Map input and Output are the same classes.  That is to say, 
the Mappers filter and normalize the data, they don't create it.  The 
{{AvroKey}} and {{AvroValue}} parameters passed to the Mapper are then returned 
by the Mapper.  There are no {{AvroKey}} or {{AvroValue}} classes instantiated 
in the Mapper.  They are passed directly through the Map method.  This 
demonstrates that the entire data set can be read by the Avro serialization 
framework, but something goes wrong during the intermediate file stage.  This 
may be an issue with how the Hadoop MapReduce framework interacts with Avro's 
library.

> Strange IndexOutofBoundException in GenericDatumReader.readString
> -
>
> Key: AVRO-1786
> URL: https://issues.apache.org/jira/browse/AVRO-1786
> Project: Avro
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.7.4, 1.7.7
> Environment: CentOS 6.5 Linux x64, 2.6.32-358.14.1.el6.x86_64
> Use IBM JVM:
> IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References 
> 20140515_199835 (JIT enabled, AOT enabled)
>Reporter: Yong Zhang
>Priority: Minor
>
> Our production cluster is CENTOS 6.5 (2.6.32-358.14.1.el6.x86_64), running 
> IBM BigInsight V3.0.0.2. In Apache term, it is 

[jira] [Commented] (AVRO-1786) Strange IndexOutofBoundException in GenericDatumReader.readString

2017-08-01 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16110083#comment-16110083
 ] 

BELUGA BEHR commented on AVRO-1786:
---

I created a Fuzzer to test random combinations of values to try to reproduce 
this issue, but no luck.  During my investigation, I hit [HADOOP-11678] because 
I forgot to set the Serializer to use 
{{org.apache.avro.hadoop.io;AvroSerialization}} and I was instead getting the 
default Hadoop Avro Serialization class of 
{{org.apache.hadoop.io.serializer.avro.AvroSerialization}} which is very broken 
because it's erroneously buffering data.  However, during the time I was not 
setting the Serialization class correctly, I was getting some very similar 
errors that is recorded here.  So, it makes me think that there may be a buffer 
that isn't being flushed during serialization.  Once i corrected the mistake, 
my Fuzzer ran for several hours and did not encounter any issues serializing or 
deserializing the random data.  I was serializing the data directly to an 
[IFile|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/IFile.java]
 Reader and then using the IFile Reader to deserialize the back into objects.

We are observing the same behaviors:

# All Mappers finish without error
# Most of the reducers finish without error, but one reducer keeps failing with 
the above error.

Interestingly, our Map input and Output are the same classes.  That is to say, 
the Mappers filter and normalize the data, they don't create it.  The 
{{AvroKey}} and {{AvroValue}} parameters passed to the Mapper are then returned 
by the Mapper.  There are no {{AvroKey}} or {{AvroValue}} classes instantiated 
in the Mapper.  They are passed directly through the Map method.  This 
demonstrates that the entire data set can be read by the Avro serialization 
framework, but something goes wrong during the intermediate file stage.  This 
may be an issue with how the Hadoop MapReduce framework interacts with Avro's 
library.

> Strange IndexOutofBoundException in GenericDatumReader.readString
> -
>
> Key: AVRO-1786
> URL: https://issues.apache.org/jira/browse/AVRO-1786
> Project: Avro
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.7.4, 1.7.7
> Environment: CentOS 6.5 Linux x64, 2.6.32-358.14.1.el6.x86_64
> Use IBM JVM:
> IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References 
> 20140515_199835 (JIT enabled, AOT enabled)
>Reporter: Yong Zhang
>Priority: Minor
>
> Our production cluster is CENTOS 6.5 (2.6.32-358.14.1.el6.x86_64), running 
> IBM BigInsight V3.0.0.2. In Apache term, it is Hadoop 2.2.0 with MRV1(no 
> yarn), and comes with AVRO 1.7.4, running with IBM J9 VM (build 2.7, JRE 
> 1.7.0 Linux amd64-64 Compressed References 20140515_199835 (JIT enabled, AOT 
> enabled). Not sure if the JDK matters, but it is NOT Oracle JVM.
> We have a ETL implemented in a chain of MR jobs. In one MR job, it is going 
> to merge 2 sets of AVRO data. Dataset1 is in HDFS location A, and Dataset2 is 
> in HDFS location B, and both contains the AVRO records binding to the same 
> AVRO schema. The record contains an unique id field, and a timestamp field. 
> The MR job is to merge the records based on the ID, and use the later 
> timestamp record to replace previous timestamp record, and omit the final 
> AVRO record out. Very straightforward.
> Now we faced a problem that one reducer keeps failing with the following 
> stacktrace on JobTracker:
> {code}
> java.lang.IndexOutOfBoundsException
>   at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:191)
>   at java.io.DataInputStream.read(DataInputStream.java:160)
>   at 
> org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:184)
>   at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
>   at 
> org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107)
>   at 
> org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:348)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:143)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:125)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:121)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154)
>   at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
>   at 
> 

[jira] [Updated] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble

2017-07-26 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2056:
--
Status: Patch Available  (was: In Progress)

> DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
> ---
>
> Key: AVRO-2056
> URL: https://issues.apache.org/jira/browse/AVRO-2056
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.8.2, 1.7.7
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2056.1.patch
>
>
> Each call to {{writeDouble}} creates a new buffer and promptly throws it away 
> even though the class has a re-usable buffer and is used in other methods 
> such as {{writeFloat}}.  Remove this extra buffer.
> {code:title=org.apache.avro.io.DirectBinaryEncoder}
>   // the buffer is used for writing floats, doubles, and large longs.
>   private final byte[] buf = new byte[12];
>   @Override
>   public void writeFloat(float f) throws IOException {
> int len = BinaryData.encodeFloat(f, buf, 0);
> out.write(buf, 0, len);
>   }
>   @Override
>   public void writeDouble(double d) throws IOException {
> byte[] buf = new byte[8];
> int len = BinaryData.encodeDouble(d, buf, 0);
> out.write(buf, 0, len);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2054) Use StringBuilder instead of StringBuffer

2017-07-26 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2054:
--
Status: Patch Available  (was: In Progress)

> Use StringBuilder instead of StringBuffer
> -
>
> Key: AVRO-2054
> URL: https://issues.apache.org/jira/browse/AVRO-2054
> Project: Avro
>  Issue Type: Improvement
>Affects Versions: 1.8.2, 1.7.7
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2054.1.patch, AVRO-2054.2.patch
>
>
> Use the un-synchronized StringBuilder instead of StringBuffer.  Use _char_ 
> values instead of Strings.
> {code:title=org.apache.trevni.MetaData}
>   @Override public String toString() {
> StringBuffer buffer = new StringBuffer();
> buffer.append("{ ");
> for (Map.Entry e : entrySet()) {
>   buffer.append(e.getKey());
>   buffer.append("=");
>   try {
> buffer.append(new String(e.getValue(), "ISO-8859-1"));
>   } catch (java.io.UnsupportedEncodingException error) {
> throw new TrevniRuntimeException(error);
>   }
>   buffer.append(" ");
> }
> buffer.append("}");
> return buffer.toString();
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings

2017-07-26 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2048:
--
Status: Patch Available  (was: In Progress)

> Avro Binary Decoding - Gracefully Handle Long Strings
> -
>
> Key: AVRO-2048
> URL: https://issues.apache.org/jira/browse/AVRO-2048
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.8.2, 1.7.7
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2048.1.patch, AVRO-2048.2.patch, AVRO-2048.3.patch
>
>
> According to the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]:
> bq. a string is encoded as a *long* followed by that many bytes of UTF-8 
> encoded character data.
> However, that is currently not being adhered to:
> {code:title=org.apache.avro.io.BinaryDecoder}
>   @Override
>   public Utf8 readString(Utf8 old) throws IOException {
> int length = readInt();
> Utf8 result = (old != null ? old : new Utf8());
> result.setByteLength(length);
> if (0 != length) {
>   doReadBytes(result.getBytes(), 0, length);
> }
> return result;
>   }
> {code}
> The first thing the code does here is to load an *int* value, not a *long*.  
> Because of the variable length nature of the size, this will mostly work.  
> However, there may be edge-cases where the serializer is putting in large 
> length values erroneously or nefariously. Let us gracefully detect such 
> scenarios and more closely adhere to the spec.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer

2017-07-26 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2049:
--
Status: Patch Available  (was: In Progress)

> Remove Superfluous Configuration From AvroSerializer
> 
>
> Key: AVRO-2049
> URL: https://issues.apache.org/jira/browse/AVRO-2049
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.8.2, 1.7.7
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2049.1.patch, AVRO-2049.2.patch, AVRO-2049.3.patch
>
>
> In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the 
> Avro block size is configured with a hard-coded value and there is a request 
> to benchmark different buffer sizes.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   /**
>* The block size for the Avro encoder.
>*
>* This number was copied from the AvroSerialization of 
> org.apache.avro.mapred in Avro 1.5.1.
>*
>* TODO(gwu): Do some benchmarking with different numbers here to see if it 
> is important.
>*/
>   private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512;
>   /** An factory for creating Avro datum encoders. */
>   private static EncoderFactory mEncoderFactory
>   = new 
> EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES);
> {code}
> However, there is no need to benchmark, this setting is superfluous and is 
> ignored with the current implementation.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   @Override
>   public void open(OutputStream outputStream) throws IOException {
> mOutputStream = outputStream;
> mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder);
>   }
> {code}
> {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting.  
> This setting is only relevant for calls to 
> {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} 
>  which considers the configured "Block Size" for doing binary encoding of 
> blocked Array types as laid out in the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex].  
> It can simply be removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2054) Use StringBuilder instead of StringBuffer

2017-07-26 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2054:
--
Status: In Progress  (was: Patch Available)

> Use StringBuilder instead of StringBuffer
> -
>
> Key: AVRO-2054
> URL: https://issues.apache.org/jira/browse/AVRO-2054
> Project: Avro
>  Issue Type: Improvement
>Affects Versions: 1.8.2, 1.7.7
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2054.1.patch, AVRO-2054.2.patch
>
>
> Use the un-synchronized StringBuilder instead of StringBuffer.  Use _char_ 
> values instead of Strings.
> {code:title=org.apache.trevni.MetaData}
>   @Override public String toString() {
> StringBuffer buffer = new StringBuffer();
> buffer.append("{ ");
> for (Map.Entry e : entrySet()) {
>   buffer.append(e.getKey());
>   buffer.append("=");
>   try {
> buffer.append(new String(e.getValue(), "ISO-8859-1"));
>   } catch (java.io.UnsupportedEncodingException error) {
> throw new TrevniRuntimeException(error);
>   }
>   buffer.append(" ");
> }
> buffer.append("}");
> return buffer.toString();
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer

2017-07-26 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2049:
--
Status: In Progress  (was: Patch Available)

> Remove Superfluous Configuration From AvroSerializer
> 
>
> Key: AVRO-2049
> URL: https://issues.apache.org/jira/browse/AVRO-2049
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.8.2, 1.7.7
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2049.1.patch, AVRO-2049.2.patch, AVRO-2049.3.patch
>
>
> In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the 
> Avro block size is configured with a hard-coded value and there is a request 
> to benchmark different buffer sizes.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   /**
>* The block size for the Avro encoder.
>*
>* This number was copied from the AvroSerialization of 
> org.apache.avro.mapred in Avro 1.5.1.
>*
>* TODO(gwu): Do some benchmarking with different numbers here to see if it 
> is important.
>*/
>   private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512;
>   /** An factory for creating Avro datum encoders. */
>   private static EncoderFactory mEncoderFactory
>   = new 
> EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES);
> {code}
> However, there is no need to benchmark, this setting is superfluous and is 
> ignored with the current implementation.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   @Override
>   public void open(OutputStream outputStream) throws IOException {
> mOutputStream = outputStream;
> mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder);
>   }
> {code}
> {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting.  
> This setting is only relevant for calls to 
> {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} 
>  which considers the configured "Block Size" for doing binary encoding of 
> blocked Array types as laid out in the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex].  
> It can simply be removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer

2017-07-26 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2049:
--
Status: Patch Available  (was: In Progress)

> Remove Superfluous Configuration From AvroSerializer
> 
>
> Key: AVRO-2049
> URL: https://issues.apache.org/jira/browse/AVRO-2049
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.8.2, 1.7.7
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2049.1.patch, AVRO-2049.2.patch, AVRO-2049.3.patch
>
>
> In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the 
> Avro block size is configured with a hard-coded value and there is a request 
> to benchmark different buffer sizes.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   /**
>* The block size for the Avro encoder.
>*
>* This number was copied from the AvroSerialization of 
> org.apache.avro.mapred in Avro 1.5.1.
>*
>* TODO(gwu): Do some benchmarking with different numbers here to see if it 
> is important.
>*/
>   private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512;
>   /** An factory for creating Avro datum encoders. */
>   private static EncoderFactory mEncoderFactory
>   = new 
> EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES);
> {code}
> However, there is no need to benchmark, this setting is superfluous and is 
> ignored with the current implementation.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   @Override
>   public void open(OutputStream outputStream) throws IOException {
> mOutputStream = outputStream;
> mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder);
>   }
> {code}
> {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting.  
> This setting is only relevant for calls to 
> {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} 
>  which considers the configured "Block Size" for doing binary encoding of 
> blocked Array types as laid out in the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex].  
> It can simply be removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer

2017-07-26 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101827#comment-16101827
 ] 

BELUGA BEHR commented on AVRO-2049:
---

[~nkollar] Maybe so, we can reuse encoder.  I'm not sure how often "open" is 
actually called though.  New ticket?

> Remove Superfluous Configuration From AvroSerializer
> 
>
> Key: AVRO-2049
> URL: https://issues.apache.org/jira/browse/AVRO-2049
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2049.1.patch, AVRO-2049.2.patch, AVRO-2049.3.patch
>
>
> In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the 
> Avro block size is configured with a hard-coded value and there is a request 
> to benchmark different buffer sizes.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   /**
>* The block size for the Avro encoder.
>*
>* This number was copied from the AvroSerialization of 
> org.apache.avro.mapred in Avro 1.5.1.
>*
>* TODO(gwu): Do some benchmarking with different numbers here to see if it 
> is important.
>*/
>   private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512;
>   /** An factory for creating Avro datum encoders. */
>   private static EncoderFactory mEncoderFactory
>   = new 
> EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES);
> {code}
> However, there is no need to benchmark, this setting is superfluous and is 
> ignored with the current implementation.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   @Override
>   public void open(OutputStream outputStream) throws IOException {
> mOutputStream = outputStream;
> mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder);
>   }
> {code}
> {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting.  
> This setting is only relevant for calls to 
> {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} 
>  which considers the configured "Block Size" for doing binary encoding of 
> blocked Array types as laid out in the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex].  
> It can simply be removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble

2017-07-26 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2056:
--
Status: In Progress  (was: Patch Available)

> DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
> ---
>
> Key: AVRO-2056
> URL: https://issues.apache.org/jira/browse/AVRO-2056
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.8.2, 1.7.7
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2056.1.patch
>
>
> Each call to {{writeDouble}} creates a new buffer and promptly throws it away 
> even though the class has a re-usable buffer and is used in other methods 
> such as {{writeFloat}}.  Remove this extra buffer.
> {code:title=org.apache.avro.io.DirectBinaryEncoder}
>   // the buffer is used for writing floats, doubles, and large longs.
>   private final byte[] buf = new byte[12];
>   @Override
>   public void writeFloat(float f) throws IOException {
> int len = BinaryData.encodeFloat(f, buf, 0);
> out.write(buf, 0, len);
>   }
>   @Override
>   public void writeDouble(double d) throws IOException {
> byte[] buf = new byte[8];
> int len = BinaryData.encodeDouble(d, buf, 0);
> out.write(buf, 0, len);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble

2017-07-26 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101805#comment-16101805
 ] 

BELUGA BEHR commented on AVRO-2056:
---

[~gszadovszky] I'm sorry, but I do not understand your request to "add the Perf 
change."  I did not change the Perf test to accomplish the testing other than 
un-comment the direct binary encoder:

{code}
  Encoder e = encoder_factory.binaryEncoder(out, null);
//Encoder e = encoder_factory.directBinaryEncoder(out, null);
{code}

> DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
> ---
>
> Key: AVRO-2056
> URL: https://issues.apache.org/jira/browse/AVRO-2056
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2056.1.patch
>
>
> Each call to {{writeDouble}} creates a new buffer and promptly throws it away 
> even though the class has a re-usable buffer and is used in other methods 
> such as {{writeFloat}}.  Remove this extra buffer.
> {code:title=org.apache.avro.io.DirectBinaryEncoder}
>   // the buffer is used for writing floats, doubles, and large longs.
>   private final byte[] buf = new byte[12];
>   @Override
>   public void writeFloat(float f) throws IOException {
> int len = BinaryData.encodeFloat(f, buf, 0);
> out.write(buf, 0, len);
>   }
>   @Override
>   public void writeDouble(double d) throws IOException {
> byte[] buf = new byte[8];
> int len = BinaryData.encodeDouble(d, buf, 0);
> out.write(buf, 0, len);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer

2017-07-26 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101798#comment-16101798
 ] 

BELUGA BEHR commented on AVRO-2049:
---

[~gszadovszky] Good catch.   Thanks.  Perhaps we should create a new ticket to 
remove the magic number.

> Remove Superfluous Configuration From AvroSerializer
> 
>
> Key: AVRO-2049
> URL: https://issues.apache.org/jira/browse/AVRO-2049
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2049.1.patch, AVRO-2049.2.patch, AVRO-2049.3.patch
>
>
> In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the 
> Avro block size is configured with a hard-coded value and there is a request 
> to benchmark different buffer sizes.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   /**
>* The block size for the Avro encoder.
>*
>* This number was copied from the AvroSerialization of 
> org.apache.avro.mapred in Avro 1.5.1.
>*
>* TODO(gwu): Do some benchmarking with different numbers here to see if it 
> is important.
>*/
>   private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512;
>   /** An factory for creating Avro datum encoders. */
>   private static EncoderFactory mEncoderFactory
>   = new 
> EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES);
> {code}
> However, there is no need to benchmark, this setting is superfluous and is 
> ignored with the current implementation.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   @Override
>   public void open(OutputStream outputStream) throws IOException {
> mOutputStream = outputStream;
> mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder);
>   }
> {code}
> {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting.  
> This setting is only relevant for calls to 
> {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} 
>  which considers the configured "Block Size" for doing binary encoding of 
> blocked Array types as laid out in the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex].  
> It can simply be removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer

2017-07-26 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2049:
--
Attachment: AVRO-2049.3.patch

> Remove Superfluous Configuration From AvroSerializer
> 
>
> Key: AVRO-2049
> URL: https://issues.apache.org/jira/browse/AVRO-2049
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2049.1.patch, AVRO-2049.2.patch, AVRO-2049.3.patch
>
>
> In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the 
> Avro block size is configured with a hard-coded value and there is a request 
> to benchmark different buffer sizes.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   /**
>* The block size for the Avro encoder.
>*
>* This number was copied from the AvroSerialization of 
> org.apache.avro.mapred in Avro 1.5.1.
>*
>* TODO(gwu): Do some benchmarking with different numbers here to see if it 
> is important.
>*/
>   private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512;
>   /** An factory for creating Avro datum encoders. */
>   private static EncoderFactory mEncoderFactory
>   = new 
> EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES);
> {code}
> However, there is no need to benchmark, this setting is superfluous and is 
> ignored with the current implementation.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   @Override
>   public void open(OutputStream outputStream) throws IOException {
> mOutputStream = outputStream;
> mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder);
>   }
> {code}
> {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting.  
> This setting is only relevant for calls to 
> {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} 
>  which considers the configured "Block Size" for doing binary encoding of 
> blocked Array types as laid out in the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex].  
> It can simply be removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer

2017-07-26 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2049:
--
Status: In Progress  (was: Patch Available)

> Remove Superfluous Configuration From AvroSerializer
> 
>
> Key: AVRO-2049
> URL: https://issues.apache.org/jira/browse/AVRO-2049
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.8.2, 1.7.7
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2049.1.patch, AVRO-2049.2.patch, AVRO-2049.3.patch
>
>
> In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the 
> Avro block size is configured with a hard-coded value and there is a request 
> to benchmark different buffer sizes.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   /**
>* The block size for the Avro encoder.
>*
>* This number was copied from the AvroSerialization of 
> org.apache.avro.mapred in Avro 1.5.1.
>*
>* TODO(gwu): Do some benchmarking with different numbers here to see if it 
> is important.
>*/
>   private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512;
>   /** An factory for creating Avro datum encoders. */
>   private static EncoderFactory mEncoderFactory
>   = new 
> EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES);
> {code}
> However, there is no need to benchmark, this setting is superfluous and is 
> ignored with the current implementation.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   @Override
>   public void open(OutputStream outputStream) throws IOException {
> mOutputStream = outputStream;
> mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder);
>   }
> {code}
> {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting.  
> This setting is only relevant for calls to 
> {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} 
>  which considers the configured "Block Size" for doing binary encoding of 
> blocked Array types as laid out in the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex].  
> It can simply be removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings

2017-07-26 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101786#comment-16101786
 ] 

BELUGA BEHR commented on AVRO-2048:
---

[~gszadovszky] I borrowed same implementation from 
[here|http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/tip/src/share/classes/java/util/ArrayList.java#l190]

> Avro Binary Decoding - Gracefully Handle Long Strings
> -
>
> Key: AVRO-2048
> URL: https://issues.apache.org/jira/browse/AVRO-2048
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2048.1.patch, AVRO-2048.2.patch, AVRO-2048.3.patch
>
>
> According to the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]:
> bq. a string is encoded as a *long* followed by that many bytes of UTF-8 
> encoded character data.
> However, that is currently not being adhered to:
> {code:title=org.apache.avro.io.BinaryDecoder}
>   @Override
>   public Utf8 readString(Utf8 old) throws IOException {
> int length = readInt();
> Utf8 result = (old != null ? old : new Utf8());
> result.setByteLength(length);
> if (0 != length) {
>   doReadBytes(result.getBytes(), 0, length);
> }
> return result;
>   }
> {code}
> The first thing the code does here is to load an *int* value, not a *long*.  
> Because of the variable length nature of the size, this will mostly work.  
> However, there may be edge-cases where the serializer is putting in large 
> length values erroneously or nefariously. Let us gracefully detect such 
> scenarios and more closely adhere to the spec.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings

2017-07-26 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2048:
--
Attachment: AVRO-2048.3.patch

> Avro Binary Decoding - Gracefully Handle Long Strings
> -
>
> Key: AVRO-2048
> URL: https://issues.apache.org/jira/browse/AVRO-2048
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2048.1.patch, AVRO-2048.2.patch, AVRO-2048.3.patch
>
>
> According to the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]:
> bq. a string is encoded as a *long* followed by that many bytes of UTF-8 
> encoded character data.
> However, that is currently not being adhered to:
> {code:title=org.apache.avro.io.BinaryDecoder}
>   @Override
>   public Utf8 readString(Utf8 old) throws IOException {
> int length = readInt();
> Utf8 result = (old != null ? old : new Utf8());
> result.setByteLength(length);
> if (0 != length) {
>   doReadBytes(result.getBytes(), 0, length);
> }
> return result;
>   }
> {code}
> The first thing the code does here is to load an *int* value, not a *long*.  
> Because of the variable length nature of the size, this will mostly work.  
> However, there may be edge-cases where the serializer is putting in large 
> length values erroneously or nefariously. Let us gracefully detect such 
> scenarios and more closely adhere to the spec.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble

2017-07-22 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16096959#comment-16096959
 ] 

BELUGA BEHR edited comment on AVRO-2056 at 7/22/17 3:20 PM:


With the patch applied, writing Doubles is between 1% and 2% faster and has the 
added benefit of cutting down on garbage collection

{code}
# Avro Master + Patch

DoubleWrite:   2762 ms  72.407   579.258   200
DoubleWrite:   2786 ms  71.787   574.293   200
DoubleWrite:   2755 ms  72.570   580.561   200  
   

# Avro Master

DoubleWrite:   2822 ms  70.871   566.965   200
DoubleWrite:   2830 ms  70.667   565.336   200
DoubleWrite:   2807 ms  71.230   569.842   200
{code}


was (Author: belugabehr):
With the patch applied, writing Doubles is between 1% and 2% faster and has the 
added benefit of cutting down on garbage collection

{code}
# Buffer Removed

DoubleWrite:   2762 ms  72.407   579.258   200
DoubleWrite:   2786 ms  71.787   574.293   200
DoubleWrite:   2755 ms  72.570   580.561   200  
   

# Buffer Present

DoubleWrite:   2822 ms  70.871   566.965   200
DoubleWrite:   2830 ms  70.667   565.336   200
DoubleWrite:   2807 ms  71.230   569.842   200
{code}

> DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
> ---
>
> Key: AVRO-2056
> URL: https://issues.apache.org/jira/browse/AVRO-2056
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2056.1.patch
>
>
> Each call to {{writeDouble}} creates a new buffer and promptly throws it away 
> even though the class has a re-usable buffer and is used in other methods 
> such as {{writeFloat}}.  Remove this extra buffer.
> {code:title=org.apache.avro.io.DirectBinaryEncoder}
>   // the buffer is used for writing floats, doubles, and large longs.
>   private final byte[] buf = new byte[12];
>   @Override
>   public void writeFloat(float f) throws IOException {
> int len = BinaryData.encodeFloat(f, buf, 0);
> out.write(buf, 0, len);
>   }
>   @Override
>   public void writeDouble(double d) throws IOException {
> byte[] buf = new byte[8];
> int len = BinaryData.encodeDouble(d, buf, 0);
> out.write(buf, 0, len);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings

2017-07-22 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16097362#comment-16097362
 ] 

BELUGA BEHR commented on AVRO-2048:
---

I can't explain why this is, but it seems to be a tad faster with this patch:

{code}
# Avro Master
StringRead:   3291 ms  12.152   432.835   1780910
StringRead:   3290 ms  12.155   432.949   1780910
StringRead:   3287 ms  12.166   433.320   1780910

# Avro Master + Patch
StringRead:   3270 ms  12.229   435.574   1780910
StringRead:   3288 ms  12.163   433.241   1780910
StringRead:   3271 ms  12.227   435.521   1780910
{code}

> Avro Binary Decoding - Gracefully Handle Long Strings
> -
>
> Key: AVRO-2048
> URL: https://issues.apache.org/jira/browse/AVRO-2048
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2048.1.patch, AVRO-2048.2.patch
>
>
> According to the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]:
> bq. a string is encoded as a *long* followed by that many bytes of UTF-8 
> encoded character data.
> However, that is currently not being adhered to:
> {code:title=org.apache.avro.io.BinaryDecoder}
>   @Override
>   public Utf8 readString(Utf8 old) throws IOException {
> int length = readInt();
> Utf8 result = (old != null ? old : new Utf8());
> result.setByteLength(length);
> if (0 != length) {
>   doReadBytes(result.getBytes(), 0, length);
> }
> return result;
>   }
> {code}
> The first thing the code does here is to load an *int* value, not a *long*.  
> Because of the variable length nature of the size, this will mostly work.  
> However, there may be edge-cases where the serializer is putting in large 
> length values erroneously or nefariously. Let us gracefully detect such 
> scenarios and more closely adhere to the spec.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (AVRO-2052) Remove org.apache.avro.file.DataFileWriter Double Buffering

2017-07-22 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR resolved AVRO-2052.
---
Resolution: Not A Problem

[~cutting] I just ran the Perf test.  Indeed, the {{directBinaryEncoder}} is 
slower than the {{binaryEncoder}} even when writing into a buffered 
OutputStream.  Closing the ticket.

> Remove org.apache.avro.file.DataFileWriter Double Buffering
> ---
>
> Key: AVRO-2052
> URL: https://issues.apache.org/jira/browse/AVRO-2052
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2052.1.patch
>
>
> {code:title=org.apache.avro.file.DataFileWriter}
>   private void init(OutputStream outs) throws IOException {
> this.underlyingStream = outs;
> this.out = new BufferedFileOutputStream(outs);
> EncoderFactory efactory = new EncoderFactory();
> this.vout = efactory.binaryEncoder(out, null);
> dout.setSchema(schema);
> buffer = new NonCopyingByteArrayOutputStream(
> Math.min((int)(syncInterval * 1.25), Integer.MAX_VALUE/2 -1));
> this.bufOut = efactory.binaryEncoder(buffer, null);
> if (this.codec == null) {
>   this.codec = CodecFactory.nullCodec().createInstance();
> }
> this.isOpen = true;
>   }
> {code}
> It's clear here that both streams are writing to a buffered destination, {{ 
> BufferedFileOutputStream}} and {{ByteArrayOutputStream}} therefore there is 
> no reason to need a buffered encoder and instead, write directly to the 
> buffered streams with {{directBinaryEncoder}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble

2017-07-21 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16096959#comment-16096959
 ] 

BELUGA BEHR commented on AVRO-2056:
---

With the patch applied, writing Doubles is between 1% and 2% faster and has the 
added benefit of cutting down on garbage collection

{code}
# Buffer Removed

DoubleWrite:   2762 ms  72.407   579.258   200
DoubleWrite:   2786 ms  71.787   574.293   200
DoubleWrite:   2755 ms  72.570   580.561   200  
   

# Buffer Present

DoubleWrite:   2822 ms  70.871   566.965   200
DoubleWrite:   2830 ms  70.667   565.336   200
DoubleWrite:   2807 ms  71.230   569.842   200
{code}

> DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
> ---
>
> Key: AVRO-2056
> URL: https://issues.apache.org/jira/browse/AVRO-2056
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2056.1.patch
>
>
> Each call to {{writeDouble}} creates a new buffer and promptly throws it away 
> even though the class has a re-usable buffer and is used in other methods 
> such as {{writeFloat}}.  Remove this extra buffer.
> {code:title=org.apache.avro.io.DirectBinaryEncoder}
>   // the buffer is used for writing floats, doubles, and large longs.
>   private final byte[] buf = new byte[12];
>   @Override
>   public void writeFloat(float f) throws IOException {
> int len = BinaryData.encodeFloat(f, buf, 0);
> out.write(buf, 0, len);
>   }
>   @Override
>   public void writeDouble(double d) throws IOException {
> byte[] buf = new byte[8];
> int len = BinaryData.encodeDouble(d, buf, 0);
> out.write(buf, 0, len);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings

2017-07-21 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2048:
--
Attachment: AVRO-2048.2.patch

Fixed checkstyle failures

> Avro Binary Decoding - Gracefully Handle Long Strings
> -
>
> Key: AVRO-2048
> URL: https://issues.apache.org/jira/browse/AVRO-2048
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2048.1.patch, AVRO-2048.2.patch
>
>
> According to the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]:
> bq. a string is encoded as a *long* followed by that many bytes of UTF-8 
> encoded character data.
> However, that is currently not being adhered to:
> {code:title=org.apache.avro.io.BinaryDecoder}
>   @Override
>   public Utf8 readString(Utf8 old) throws IOException {
> int length = readInt();
> Utf8 result = (old != null ? old : new Utf8());
> result.setByteLength(length);
> if (0 != length) {
>   doReadBytes(result.getBytes(), 0, length);
> }
> return result;
>   }
> {code}
> The first thing the code does here is to load an *int* value, not a *long*.  
> Because of the variable length nature of the size, this will mostly work.  
> However, there may be edge-cases where the serializer is putting in large 
> length values erroneously or nefariously. Let us gracefully detect such 
> scenarios and more closely adhere to the spec.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2054) Use StringBuilder instead of StringBuffer

2017-07-21 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2054:
--
Attachment: AVRO-2054.2.patch

Renamed the automatic variable from _buffer_ to _builder_.

> Use StringBuilder instead of StringBuffer
> -
>
> Key: AVRO-2054
> URL: https://issues.apache.org/jira/browse/AVRO-2054
> Project: Avro
>  Issue Type: Improvement
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2054.1.patch, AVRO-2054.2.patch
>
>
> Use the un-synchronized StringBuilder instead of StringBuffer.  Use _char_ 
> values instead of Strings.
> {code:title=org.apache.trevni.MetaData}
>   @Override public String toString() {
> StringBuffer buffer = new StringBuffer();
> buffer.append("{ ");
> for (Map.Entry e : entrySet()) {
>   buffer.append(e.getKey());
>   buffer.append("=");
>   try {
> buffer.append(new String(e.getValue(), "ISO-8859-1"));
>   } catch (java.io.UnsupportedEncodingException error) {
> throw new TrevniRuntimeException(error);
>   }
>   buffer.append(" ");
> }
> buffer.append("}");
> return buffer.toString();
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Issue Comment Deleted] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble

2017-07-21 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2056:
--
Comment: was deleted

(was: Renamed the automatic variable from _buffer_ to _builder_.)

> DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
> ---
>
> Key: AVRO-2056
> URL: https://issues.apache.org/jira/browse/AVRO-2056
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2056.1.patch
>
>
> Each call to {{writeDouble}} creates a new buffer and promptly throws it away 
> even though the class has a re-usable buffer and is used in other methods 
> such as {{writeFloat}}.  Remove this extra buffer.
> {code:title=org.apache.avro.io.DirectBinaryEncoder}
>   // the buffer is used for writing floats, doubles, and large longs.
>   private final byte[] buf = new byte[12];
>   @Override
>   public void writeFloat(float f) throws IOException {
> int len = BinaryData.encodeFloat(f, buf, 0);
> out.write(buf, 0, len);
>   }
>   @Override
>   public void writeDouble(double d) throws IOException {
> byte[] buf = new byte[8];
> int len = BinaryData.encodeDouble(d, buf, 0);
> out.write(buf, 0, len);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble

2017-07-21 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2056:
--
Attachment: AVRO-2054.2.patch

Renamed the automatic variable from _buffer_ to _builder_.

> DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
> ---
>
> Key: AVRO-2056
> URL: https://issues.apache.org/jira/browse/AVRO-2056
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2056.1.patch
>
>
> Each call to {{writeDouble}} creates a new buffer and promptly throws it away 
> even though the class has a re-usable buffer and is used in other methods 
> such as {{writeFloat}}.  Remove this extra buffer.
> {code:title=org.apache.avro.io.DirectBinaryEncoder}
>   // the buffer is used for writing floats, doubles, and large longs.
>   private final byte[] buf = new byte[12];
>   @Override
>   public void writeFloat(float f) throws IOException {
> int len = BinaryData.encodeFloat(f, buf, 0);
> out.write(buf, 0, len);
>   }
>   @Override
>   public void writeDouble(double d) throws IOException {
> byte[] buf = new byte[8];
> int len = BinaryData.encodeDouble(d, buf, 0);
> out.write(buf, 0, len);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble

2017-07-21 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2056:
--
Attachment: (was: AVRO-2054.2.patch)

> DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
> ---
>
> Key: AVRO-2056
> URL: https://issues.apache.org/jira/browse/AVRO-2056
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2056.1.patch
>
>
> Each call to {{writeDouble}} creates a new buffer and promptly throws it away 
> even though the class has a re-usable buffer and is used in other methods 
> such as {{writeFloat}}.  Remove this extra buffer.
> {code:title=org.apache.avro.io.DirectBinaryEncoder}
>   // the buffer is used for writing floats, doubles, and large longs.
>   private final byte[] buf = new byte[12];
>   @Override
>   public void writeFloat(float f) throws IOException {
> int len = BinaryData.encodeFloat(f, buf, 0);
> out.write(buf, 0, len);
>   }
>   @Override
>   public void writeDouble(double d) throws IOException {
> byte[] buf = new byte[8];
> int len = BinaryData.encodeDouble(d, buf, 0);
> out.write(buf, 0, len);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble

2017-07-18 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2056:
--
Affects Version/s: 1.7.7
   1.8.2

> DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
> ---
>
> Key: AVRO-2056
> URL: https://issues.apache.org/jira/browse/AVRO-2056
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2056.1.patch
>
>
> Each call to {{writeDouble}} creates a new buffer and promptly throws it away 
> even though the class has a re-usable buffer and is used in other methods 
> such as {{writeFloat}}.  Remove this extra buffer.
> {code:title=org.apache.avro.io.DirectBinaryEncoder}
>   // the buffer is used for writing floats, doubles, and large longs.
>   private final byte[] buf = new byte[12];
>   @Override
>   public void writeFloat(float f) throws IOException {
> int len = BinaryData.encodeFloat(f, buf, 0);
> out.write(buf, 0, len);
>   }
>   @Override
>   public void writeDouble(double d) throws IOException {
> byte[] buf = new byte[8];
> int len = BinaryData.encodeDouble(d, buf, 0);
> out.write(buf, 0, len);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble

2017-07-18 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2056:
--
Attachment: AVRO-2056.1.patch

> DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
> ---
>
> Key: AVRO-2056
> URL: https://issues.apache.org/jira/browse/AVRO-2056
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Reporter: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2056.1.patch
>
>
> Each call to {{writeDouble}} creates a new buffer and promptly throws it away 
> even though the class has a re-usable buffer and is used in other methods 
> such as {{writeFloat}}.  Remove this extra buffer.
> {code:title=org.apache.avro.io.DirectBinaryEncoder}
>   // the buffer is used for writing floats, doubles, and large longs.
>   private final byte[] buf = new byte[12];
>   @Override
>   public void writeFloat(float f) throws IOException {
> int len = BinaryData.encodeFloat(f, buf, 0);
> out.write(buf, 0, len);
>   }
>   @Override
>   public void writeDouble(double d) throws IOException {
> byte[] buf = new byte[8];
> int len = BinaryData.encodeDouble(d, buf, 0);
> out.write(buf, 0, len);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble

2017-07-18 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR reassigned AVRO-2056:
-

Assignee: BELUGA BEHR

> DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
> ---
>
> Key: AVRO-2056
> URL: https://issues.apache.org/jira/browse/AVRO-2056
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2056.1.patch
>
>
> Each call to {{writeDouble}} creates a new buffer and promptly throws it away 
> even though the class has a re-usable buffer and is used in other methods 
> such as {{writeFloat}}.  Remove this extra buffer.
> {code:title=org.apache.avro.io.DirectBinaryEncoder}
>   // the buffer is used for writing floats, doubles, and large longs.
>   private final byte[] buf = new byte[12];
>   @Override
>   public void writeFloat(float f) throws IOException {
> int len = BinaryData.encodeFloat(f, buf, 0);
> out.write(buf, 0, len);
>   }
>   @Override
>   public void writeDouble(double d) throws IOException {
> byte[] buf = new byte[8];
> int len = BinaryData.encodeDouble(d, buf, 0);
> out.write(buf, 0, len);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble

2017-07-18 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created AVRO-2056:
-

 Summary: DirectBinaryEncoder Creates Buffer For Each Call To 
writeDouble
 Key: AVRO-2056
 URL: https://issues.apache.org/jira/browse/AVRO-2056
 Project: Avro
  Issue Type: Improvement
  Components: java
Reporter: BELUGA BEHR
Priority: Minor


Each call to {{writeDouble}} creates a new buffer and promptly throws it away 
even though the class has a re-usable buffer and is used in other methods such 
as {{writeFloat}}.  Remove this extra buffer.

{code:title=org.apache.avro.io.DirectBinaryEncoder}
  // the buffer is used for writing floats, doubles, and large longs.
  private final byte[] buf = new byte[12];

  @Override
  public void writeFloat(float f) throws IOException {
int len = BinaryData.encodeFloat(f, buf, 0);
out.write(buf, 0, len);
  }

  @Override
  public void writeDouble(double d) throws IOException {
byte[] buf = new byte[8];
int len = BinaryData.encodeDouble(d, buf, 0);
out.write(buf, 0, len);
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble

2017-07-18 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2056:
--
Status: Patch Available  (was: Open)

> DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
> ---
>
> Key: AVRO-2056
> URL: https://issues.apache.org/jira/browse/AVRO-2056
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2056.1.patch
>
>
> Each call to {{writeDouble}} creates a new buffer and promptly throws it away 
> even though the class has a re-usable buffer and is used in other methods 
> such as {{writeFloat}}.  Remove this extra buffer.
> {code:title=org.apache.avro.io.DirectBinaryEncoder}
>   // the buffer is used for writing floats, doubles, and large longs.
>   private final byte[] buf = new byte[12];
>   @Override
>   public void writeFloat(float f) throws IOException {
> int len = BinaryData.encodeFloat(f, buf, 0);
> out.write(buf, 0, len);
>   }
>   @Override
>   public void writeDouble(double d) throws IOException {
> byte[] buf = new byte[8];
> int len = BinaryData.encodeDouble(d, buf, 0);
> out.write(buf, 0, len);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AVRO-2052) Remove org.apache.avro.file.DataFileWriter Double Buffering

2017-07-18 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16092115#comment-16092115
 ] 

BELUGA BEHR commented on AVRO-2052:
---

[~cutting] There is a small amount of buffering in {{DirectBinaryEncoder}}.  
There exists a 12-byte buffer.  However, in one case this buffer is not used 
and an array is instantiated for each invocation.  Seems like an oversight to 
me.  Thoughts?

{code:title=org.apache.avro.io.DirectBinaryEncoder.writeDouble(double)}
  @Override
  public void writeDouble(double d) throws IOException {
byte[] buf = new byte[8];
int len = BinaryData.encodeDouble(d, buf, 0);
out.write(buf, 0, len);
  }
{code}

> Remove org.apache.avro.file.DataFileWriter Double Buffering
> ---
>
> Key: AVRO-2052
> URL: https://issues.apache.org/jira/browse/AVRO-2052
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2052.1.patch
>
>
> {code:title=org.apache.avro.file.DataFileWriter}
>   private void init(OutputStream outs) throws IOException {
> this.underlyingStream = outs;
> this.out = new BufferedFileOutputStream(outs);
> EncoderFactory efactory = new EncoderFactory();
> this.vout = efactory.binaryEncoder(out, null);
> dout.setSchema(schema);
> buffer = new NonCopyingByteArrayOutputStream(
> Math.min((int)(syncInterval * 1.25), Integer.MAX_VALUE/2 -1));
> this.bufOut = efactory.binaryEncoder(buffer, null);
> if (this.codec == null) {
>   this.codec = CodecFactory.nullCodec().createInstance();
> }
> this.isOpen = true;
>   }
> {code}
> It's clear here that both streams are writing to a buffered destination, {{ 
> BufferedFileOutputStream}} and {{ByteArrayOutputStream}} therefore there is 
> no reason to need a buffered encoder and instead, write directly to the 
> buffered streams with {{directBinaryEncoder}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AVRO-2050) Clear Array To Allow GC

2017-07-18 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16092098#comment-16092098
 ] 

BELUGA BEHR commented on AVRO-2050:
---

[~cutting] Thanks for the input.  I was simply mimicking the OpenJDK 
implementation.  I attached a new patch.  Either will work. :)

> Clear Array To Allow GC
> ---
>
> Key: AVRO-2050
> URL: https://issues.apache.org/jira/browse/AVRO-2050
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2050.1.patch, AVRO-2050.2.patch, AVRO-2050.3.patch
>
>
> Java's {{ArrayList}} implementation clears all Objects from the internal 
> buffer when the {{clear()}} method is called.  This allows the Objects to be 
> free for GC.  We should do the same in Avro 
> {{org.apache.avro.generic.GenericData}} 
> [ArrayList 
> Source|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/ArrayList.java#ArrayList.clear%28%29]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2050) Clear Array To Allow GC

2017-07-18 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2050:
--
Status: Patch Available  (was: Open)

> Clear Array To Allow GC
> ---
>
> Key: AVRO-2050
> URL: https://issues.apache.org/jira/browse/AVRO-2050
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.8.2, 1.7.7
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2050.1.patch, AVRO-2050.2.patch, AVRO-2050.3.patch
>
>
> Java's {{ArrayList}} implementation clears all Objects from the internal 
> buffer when the {{clear()}} method is called.  This allows the Objects to be 
> free for GC.  We should do the same in Avro 
> {{org.apache.avro.generic.GenericData}} 
> [ArrayList 
> Source|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/ArrayList.java#ArrayList.clear%28%29]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2050) Clear Array To Allow GC

2017-07-18 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2050:
--
Status: Open  (was: Patch Available)

> Clear Array To Allow GC
> ---
>
> Key: AVRO-2050
> URL: https://issues.apache.org/jira/browse/AVRO-2050
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.8.2, 1.7.7
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2050.1.patch, AVRO-2050.2.patch, AVRO-2050.3.patch
>
>
> Java's {{ArrayList}} implementation clears all Objects from the internal 
> buffer when the {{clear()}} method is called.  This allows the Objects to be 
> free for GC.  We should do the same in Avro 
> {{org.apache.avro.generic.GenericData}} 
> [ArrayList 
> Source|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/ArrayList.java#ArrayList.clear%28%29]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2050) Clear Array To Allow GC

2017-07-18 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2050:
--
Attachment: AVRO-2050.3.patch

> Clear Array To Allow GC
> ---
>
> Key: AVRO-2050
> URL: https://issues.apache.org/jira/browse/AVRO-2050
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2050.1.patch, AVRO-2050.2.patch, AVRO-2050.3.patch
>
>
> Java's {{ArrayList}} implementation clears all Objects from the internal 
> buffer when the {{clear()}} method is called.  This allows the Objects to be 
> free for GC.  We should do the same in Avro 
> {{org.apache.avro.generic.GenericData}} 
> [ArrayList 
> Source|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/ArrayList.java#ArrayList.clear%28%29]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2055) Remove Magic Value From org.apache.avro.hadoop.io.AvroSequenceFile

2017-07-18 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2055:
--
Status: Patch Available  (was: Open)

> Remove Magic Value From org.apache.avro.hadoop.io.AvroSequenceFile
> --
>
> Key: AVRO-2055
> URL: https://issues.apache.org/jira/browse/AVRO-2055
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.8.2, 1.7.7
>Reporter: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2055.1.patch
>
>
> Remove magic string _io.file.buffer.size_ and _DEFAULT_BUFFER_SIZE_BYTES_ and 
> instead rely on the Hadoop libraries to provide this information.  Will help 
> to keep Avro in sync with changes in Hadoop.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2055) Remove Magic Value From org.apache.avro.hadoop.io.AvroSequenceFile

2017-07-18 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2055:
--
Attachment: AVRO-2055.1.patch

> Remove Magic Value From org.apache.avro.hadoop.io.AvroSequenceFile
> --
>
> Key: AVRO-2055
> URL: https://issues.apache.org/jira/browse/AVRO-2055
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2055.1.patch
>
>
> Remove magic string _io.file.buffer.size_ and _DEFAULT_BUFFER_SIZE_BYTES_ and 
> instead rely on the Hadoop libraries to provide this information.  Will help 
> to keep Avro in sync with changes in Hadoop.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (AVRO-2055) Remove Magic Value From org.apache.avro.hadoop.io.AvroSequenceFile

2017-07-18 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR reassigned AVRO-2055:
-

Assignee: BELUGA BEHR

> Remove Magic Value From org.apache.avro.hadoop.io.AvroSequenceFile
> --
>
> Key: AVRO-2055
> URL: https://issues.apache.org/jira/browse/AVRO-2055
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2055.1.patch
>
>
> Remove magic string _io.file.buffer.size_ and _DEFAULT_BUFFER_SIZE_BYTES_ and 
> instead rely on the Hadoop libraries to provide this information.  Will help 
> to keep Avro in sync with changes in Hadoop.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (AVRO-2055) Remove Magic Value From org.apache.avro.hadoop.io.AvroSequenceFile

2017-07-18 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created AVRO-2055:
-

 Summary: Remove Magic Value From 
org.apache.avro.hadoop.io.AvroSequenceFile
 Key: AVRO-2055
 URL: https://issues.apache.org/jira/browse/AVRO-2055
 Project: Avro
  Issue Type: Improvement
  Components: java
Affects Versions: 1.8.2, 1.7.7
Reporter: BELUGA BEHR
Priority: Trivial


Remove magic string _io.file.buffer.size_ and _DEFAULT_BUFFER_SIZE_BYTES_ and 
instead rely on the Hadoop libraries to provide this information.  Will help to 
keep Avro in sync with changes in Hadoop.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2054) Use StringBuilder instead of StringBuffer

2017-07-18 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2054:
--
Status: Patch Available  (was: Open)

> Use StringBuilder instead of StringBuffer
> -
>
> Key: AVRO-2054
> URL: https://issues.apache.org/jira/browse/AVRO-2054
> Project: Avro
>  Issue Type: Improvement
>Affects Versions: 1.8.2, 1.7.7
>Reporter: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2054.1.patch
>
>
> Use the un-synchronized StringBuilder instead of StringBuffer.  Use _char_ 
> values instead of Strings.
> {code:title=org.apache.trevni.MetaData}
>   @Override public String toString() {
> StringBuffer buffer = new StringBuffer();
> buffer.append("{ ");
> for (Map.Entry e : entrySet()) {
>   buffer.append(e.getKey());
>   buffer.append("=");
>   try {
> buffer.append(new String(e.getValue(), "ISO-8859-1"));
>   } catch (java.io.UnsupportedEncodingException error) {
> throw new TrevniRuntimeException(error);
>   }
>   buffer.append(" ");
> }
> buffer.append("}");
> return buffer.toString();
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2054) Use StringBuilder instead of StringBuffer

2017-07-18 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2054:
--
Attachment: AVRO-2054.1.patch

> Use StringBuilder instead of StringBuffer
> -
>
> Key: AVRO-2054
> URL: https://issues.apache.org/jira/browse/AVRO-2054
> Project: Avro
>  Issue Type: Improvement
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2054.1.patch
>
>
> Use the un-synchronized StringBuilder instead of StringBuffer.  Use _char_ 
> values instead of Strings.
> {code:title=org.apache.trevni.MetaData}
>   @Override public String toString() {
> StringBuffer buffer = new StringBuffer();
> buffer.append("{ ");
> for (Map.Entry e : entrySet()) {
>   buffer.append(e.getKey());
>   buffer.append("=");
>   try {
> buffer.append(new String(e.getValue(), "ISO-8859-1"));
>   } catch (java.io.UnsupportedEncodingException error) {
> throw new TrevniRuntimeException(error);
>   }
>   buffer.append(" ");
> }
> buffer.append("}");
> return buffer.toString();
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (AVRO-2054) Use StringBuilder instead of StringBuffer

2017-07-18 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR reassigned AVRO-2054:
-

Assignee: BELUGA BEHR

> Use StringBuilder instead of StringBuffer
> -
>
> Key: AVRO-2054
> URL: https://issues.apache.org/jira/browse/AVRO-2054
> Project: Avro
>  Issue Type: Improvement
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2054.1.patch
>
>
> Use the un-synchronized StringBuilder instead of StringBuffer.  Use _char_ 
> values instead of Strings.
> {code:title=org.apache.trevni.MetaData}
>   @Override public String toString() {
> StringBuffer buffer = new StringBuffer();
> buffer.append("{ ");
> for (Map.Entry e : entrySet()) {
>   buffer.append(e.getKey());
>   buffer.append("=");
>   try {
> buffer.append(new String(e.getValue(), "ISO-8859-1"));
>   } catch (java.io.UnsupportedEncodingException error) {
> throw new TrevniRuntimeException(error);
>   }
>   buffer.append(" ");
> }
> buffer.append("}");
> return buffer.toString();
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (AVRO-2054) Use StringBuilder instead of StringBuffer

2017-07-18 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created AVRO-2054:
-

 Summary: Use StringBuilder instead of StringBuffer
 Key: AVRO-2054
 URL: https://issues.apache.org/jira/browse/AVRO-2054
 Project: Avro
  Issue Type: Improvement
Affects Versions: 1.8.2, 1.7.7
Reporter: BELUGA BEHR
Priority: Trivial


Use the un-synchronized StringBuilder instead of StringBuffer.  Use _char_ 
values instead of Strings.

{code:title=org.apache.trevni.MetaData}
  @Override public String toString() {
StringBuffer buffer = new StringBuffer();
buffer.append("{ ");
for (Map.Entry e : entrySet()) {
  buffer.append(e.getKey());
  buffer.append("=");
  try {
buffer.append(new String(e.getValue(), "ISO-8859-1"));
  } catch (java.io.UnsupportedEncodingException error) {
throw new TrevniRuntimeException(error);
  }
  buffer.append(" ");
}
buffer.append("}");
return buffer.toString();
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2053) Remove Reference To Deprecated Property mapred.output.compression.type

2017-07-18 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2053:
--
Description: 
Avro utilizes 
[deprecated|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/DeprecatedProperties.html]
 property _mapred.output.compression.type_.  Update code to use the MRv2 
property and don't override default behaviors/settings.  Use the appropriate 
facilities from {{org.apache.hadoop.mapreduce.lib.output.FileOutputFormat}} and 
{{org.apache.hadoop.io.SequenceFile}}.

{code:title=org.apache.avro.mapreduce.AvroSequenceFileOutputFormat}
  /** Configuration key for storing the type of compression for the target 
sequence file. */
  private static final String CONF_COMPRESSION_TYPE = 
"mapred.output.compression.type";

  /** The default compression type for the target sequence file. */
  private static final CompressionType DEFAULT_COMPRESSION_TYPE = 
CompressionType.RECORD;
{code}

  was:
Avro utilizes 
[deprecated|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/DeprecatedProperties.html]
 property "mapred.output.compression.type".  Update code to use the MRv2 
property and don't override default behaviors/settings.  Use the appropriate 
facilities from {{org.apache.hadoop.mapreduce.lib.output.FileOutputFormat}} and 
{{org.apache.hadoop.io.SequenceFile}}.

{code:title=org.apache.avro.mapreduce.AvroSequenceFileOutputFormat}
  /** Configuration key for storing the type of compression for the target 
sequence file. */
  private static final String CONF_COMPRESSION_TYPE = 
"mapred.output.compression.type";

  /** The default compression type for the target sequence file. */
  private static final CompressionType DEFAULT_COMPRESSION_TYPE = 
CompressionType.RECORD;
{code}


> Remove Reference To Deprecated Property mapred.output.compression.type
> --
>
> Key: AVRO-2053
> URL: https://issues.apache.org/jira/browse/AVRO-2053
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2053.1.patch
>
>
> Avro utilizes 
> [deprecated|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/DeprecatedProperties.html]
>  property _mapred.output.compression.type_.  Update code to use the MRv2 
> property and don't override default behaviors/settings.  Use the appropriate 
> facilities from {{org.apache.hadoop.mapreduce.lib.output.FileOutputFormat}} 
> and {{org.apache.hadoop.io.SequenceFile}}.
> {code:title=org.apache.avro.mapreduce.AvroSequenceFileOutputFormat}
>   /** Configuration key for storing the type of compression for the target 
> sequence file. */
>   private static final String CONF_COMPRESSION_TYPE = 
> "mapred.output.compression.type";
>   /** The default compression type for the target sequence file. */
>   private static final CompressionType DEFAULT_COMPRESSION_TYPE = 
> CompressionType.RECORD;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (AVRO-2053) Remove Reference To Deprecated Property mapred.output.compression.type

2017-07-18 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR reassigned AVRO-2053:
-

Assignee: BELUGA BEHR

> Remove Reference To Deprecated Property mapred.output.compression.type
> --
>
> Key: AVRO-2053
> URL: https://issues.apache.org/jira/browse/AVRO-2053
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2053.1.patch
>
>
> Avro utilizes 
> [deprecated|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/DeprecatedProperties.html]
>  property "mapred.output.compression.type".  Update code to use the MRv2 
> property and don't override default behaviors/settings.  Use the appropriate 
> facilities from {{org.apache.hadoop.mapreduce.lib.output.FileOutputFormat}} 
> and {{org.apache.hadoop.io.SequenceFile}}.
> {code:title=org.apache.avro.mapreduce.AvroSequenceFileOutputFormat}
>   /** Configuration key for storing the type of compression for the target 
> sequence file. */
>   private static final String CONF_COMPRESSION_TYPE = 
> "mapred.output.compression.type";
>   /** The default compression type for the target sequence file. */
>   private static final CompressionType DEFAULT_COMPRESSION_TYPE = 
> CompressionType.RECORD;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2053) Remove Reference To Deprecated Property mapred.output.compression.type

2017-07-18 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2053:
--
Status: Patch Available  (was: Open)

> Remove Reference To Deprecated Property mapred.output.compression.type
> --
>
> Key: AVRO-2053
> URL: https://issues.apache.org/jira/browse/AVRO-2053
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.8.2, 1.7.7
>Reporter: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2053.1.patch
>
>
> Avro utilizes 
> [deprecated|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/DeprecatedProperties.html]
>  property "mapred.output.compression.type".  Update code to use the MRv2 
> property and don't override default behaviors/settings.  Use the appropriate 
> facilities from {{org.apache.hadoop.mapreduce.lib.output.FileOutputFormat}} 
> and {{org.apache.hadoop.io.SequenceFile}}.
> {code:title=org.apache.avro.mapreduce.AvroSequenceFileOutputFormat}
>   /** Configuration key for storing the type of compression for the target 
> sequence file. */
>   private static final String CONF_COMPRESSION_TYPE = 
> "mapred.output.compression.type";
>   /** The default compression type for the target sequence file. */
>   private static final CompressionType DEFAULT_COMPRESSION_TYPE = 
> CompressionType.RECORD;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2053) Remove Reference To Deprecated Property mapred.output.compression.type

2017-07-18 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2053:
--
Attachment: AVRO-2053.1.patch

> Remove Reference To Deprecated Property mapred.output.compression.type
> --
>
> Key: AVRO-2053
> URL: https://issues.apache.org/jira/browse/AVRO-2053
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2053.1.patch
>
>
> Avro utilizes 
> [deprecated|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/DeprecatedProperties.html]
>  property "mapred.output.compression.type".  Update code to use the MRv2 
> property and don't override default behaviors/settings.  Use the appropriate 
> facilities from {{org.apache.hadoop.mapreduce.lib.output.FileOutputFormat}} 
> and {{org.apache.hadoop.io.SequenceFile}}.
> {code:title=org.apache.avro.mapreduce.AvroSequenceFileOutputFormat}
>   /** Configuration key for storing the type of compression for the target 
> sequence file. */
>   private static final String CONF_COMPRESSION_TYPE = 
> "mapred.output.compression.type";
>   /** The default compression type for the target sequence file. */
>   private static final CompressionType DEFAULT_COMPRESSION_TYPE = 
> CompressionType.RECORD;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (AVRO-2053) Remove Reference To Deprecated Property mapred.output.compression.type

2017-07-18 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created AVRO-2053:
-

 Summary: Remove Reference To Deprecated Property 
mapred.output.compression.type
 Key: AVRO-2053
 URL: https://issues.apache.org/jira/browse/AVRO-2053
 Project: Avro
  Issue Type: Improvement
  Components: java
Affects Versions: 1.8.2, 1.7.7
Reporter: BELUGA BEHR
Priority: Trivial


Avro utilizes 
[deprecated|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/DeprecatedProperties.html]
 property "mapred.output.compression.type".  Update code to use the MRv2 
property and don't override default behaviors/settings.  Use the appropriate 
facilities from {{org.apache.hadoop.mapreduce.lib.output.FileOutputFormat}} and 
{{org.apache.hadoop.io.SequenceFile}}.

{code:title=org.apache.avro.mapreduce.AvroSequenceFileOutputFormat}
  /** Configuration key for storing the type of compression for the target 
sequence file. */
  private static final String CONF_COMPRESSION_TYPE = 
"mapred.output.compression.type";

  /** The default compression type for the target sequence file. */
  private static final CompressionType DEFAULT_COMPRESSION_TYPE = 
CompressionType.RECORD;
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings

2017-07-18 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR reassigned AVRO-2048:
-

Assignee: BELUGA BEHR

> Avro Binary Decoding - Gracefully Handle Long Strings
> -
>
> Key: AVRO-2048
> URL: https://issues.apache.org/jira/browse/AVRO-2048
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2048.1.patch
>
>
> According to the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]:
> bq. a string is encoded as a *long* followed by that many bytes of UTF-8 
> encoded character data.
> However, that is currently not being adhered to:
> {code:title=org.apache.avro.io.BinaryDecoder}
>   @Override
>   public Utf8 readString(Utf8 old) throws IOException {
> int length = readInt();
> Utf8 result = (old != null ? old : new Utf8());
> result.setByteLength(length);
> if (0 != length) {
>   doReadBytes(result.getBytes(), 0, length);
> }
> return result;
>   }
> {code}
> The first thing the code does here is to load an *int* value, not a *long*.  
> Because of the variable length nature of the size, this will mostly work.  
> However, there may be edge-cases where the serializer is putting in large 
> length values erroneously or nefariously. Let us gracefully detect such 
> scenarios and more closely adhere to the spec.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (AVRO-2052) Remove org.apache.avro.file.DataFileWriter Double Buffering

2017-07-18 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR reassigned AVRO-2052:
-

Assignee: BELUGA BEHR

> Remove org.apache.avro.file.DataFileWriter Double Buffering
> ---
>
> Key: AVRO-2052
> URL: https://issues.apache.org/jira/browse/AVRO-2052
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2052.1.patch
>
>
> {code:title=org.apache.avro.file.DataFileWriter}
>   private void init(OutputStream outs) throws IOException {
> this.underlyingStream = outs;
> this.out = new BufferedFileOutputStream(outs);
> EncoderFactory efactory = new EncoderFactory();
> this.vout = efactory.binaryEncoder(out, null);
> dout.setSchema(schema);
> buffer = new NonCopyingByteArrayOutputStream(
> Math.min((int)(syncInterval * 1.25), Integer.MAX_VALUE/2 -1));
> this.bufOut = efactory.binaryEncoder(buffer, null);
> if (this.codec == null) {
>   this.codec = CodecFactory.nullCodec().createInstance();
> }
> this.isOpen = true;
>   }
> {code}
> It's clear here that both streams are writing to a buffered destination, {{ 
> BufferedFileOutputStream}} and {{ByteArrayOutputStream}} therefore there is 
> no reason to need a buffered encoder and instead, write directly to the 
> buffered streams with {{directBinaryEncoder}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2052) Remove org.apache.avro.file.DataFileWriter Double Buffering

2017-07-17 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2052:
--
Attachment: AVRO-2052.1.patch

Call {{directBinaryEncoder}} instead of the buffered {{binaryEncoder}}

> Remove org.apache.avro.file.DataFileWriter Double Buffering
> ---
>
> Key: AVRO-2052
> URL: https://issues.apache.org/jira/browse/AVRO-2052
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2052.1.patch
>
>
> {code:title=org.apache.avro.file.DataFileWriter}
>   private void init(OutputStream outs) throws IOException {
> this.underlyingStream = outs;
> this.out = new BufferedFileOutputStream(outs);
> EncoderFactory efactory = new EncoderFactory();
> this.vout = efactory.binaryEncoder(out, null);
> dout.setSchema(schema);
> buffer = new NonCopyingByteArrayOutputStream(
> Math.min((int)(syncInterval * 1.25), Integer.MAX_VALUE/2 -1));
> this.bufOut = efactory.binaryEncoder(buffer, null);
> if (this.codec == null) {
>   this.codec = CodecFactory.nullCodec().createInstance();
> }
> this.isOpen = true;
>   }
> {code}
> It's clear here that both streams are writing to a buffered destination, {{ 
> BufferedFileOutputStream}} and {{ByteArrayOutputStream}} therefore there is 
> no reason to need a buffered encoder and instead, write directly to the 
> buffered streams with {{directBinaryEncoder}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (AVRO-2052) Remove org.apache.avro.file.DataFileWriter Double Buffering

2017-07-17 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created AVRO-2052:
-

 Summary: Remove org.apache.avro.file.DataFileWriter Double 
Buffering
 Key: AVRO-2052
 URL: https://issues.apache.org/jira/browse/AVRO-2052
 Project: Avro
  Issue Type: Improvement
  Components: java
Affects Versions: 1.8.2, 1.7.7
Reporter: BELUGA BEHR
Priority: Trivial


{code:title=org.apache.avro.file.DataFileWriter}
  private void init(OutputStream outs) throws IOException {
this.underlyingStream = outs;
this.out = new BufferedFileOutputStream(outs);
EncoderFactory efactory = new EncoderFactory();
this.vout = efactory.binaryEncoder(out, null);
dout.setSchema(schema);
buffer = new NonCopyingByteArrayOutputStream(
Math.min((int)(syncInterval * 1.25), Integer.MAX_VALUE/2 -1));
this.bufOut = efactory.binaryEncoder(buffer, null);
if (this.codec == null) {
  this.codec = CodecFactory.nullCodec().createInstance();
}
this.isOpen = true;
  }
{code}

It's clear here that both streams are writing to a buffered destination, {{ 
BufferedFileOutputStream}} and {{ByteArrayOutputStream}} therefore there is no 
reason to need a buffered encoder and instead, write directly to the buffered 
streams with {{directBinaryEncoder}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AVRO-1786) Strange IndexOutofBoundException in GenericDatumReader.readString

2017-07-17 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089823#comment-16089823
 ] 

BELUGA BEHR commented on AVRO-1786:
---

May be experiencing this issue as well trying to collect more information...

> Strange IndexOutofBoundException in GenericDatumReader.readString
> -
>
> Key: AVRO-1786
> URL: https://issues.apache.org/jira/browse/AVRO-1786
> Project: Avro
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.7.4, 1.7.7
> Environment: CentOS 6.5 Linux x64, 2.6.32-358.14.1.el6.x86_64
> Use IBM JVM:
> IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References 
> 20140515_199835 (JIT enabled, AOT enabled)
>Reporter: Yong Zhang
>Priority: Minor
>
> Our production cluster is CENTOS 6.5 (2.6.32-358.14.1.el6.x86_64), running 
> IBM BigInsight V3.0.0.2. In Apache term, it is Hadoop 2.2.0 with MRV1(no 
> yarn), and comes with AVRO 1.7.4, running with IBM J9 VM (build 2.7, JRE 
> 1.7.0 Linux amd64-64 Compressed References 20140515_199835 (JIT enabled, AOT 
> enabled). Not sure if the JDK matters, but it is NOT Oracle JVM.
> We have a ETL implemented in a chain of MR jobs. In one MR job, it is going 
> to merge 2 sets of AVRO data. Dataset1 is in HDFS location A, and Dataset2 is 
> in HDFS location B, and both contains the AVRO records binding to the same 
> AVRO schema. The record contains an unique id field, and a timestamp field. 
> The MR job is to merge the records based on the ID, and use the later 
> timestamp record to replace previous timestamp record, and omit the final 
> AVRO record out. Very straightforward.
> Now we faced a problem that one reducer keeps failing with the following 
> stacktrace on JobTracker:
> {code}
> java.lang.IndexOutOfBoundsException
>   at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:191)
>   at java.io.DataInputStream.read(DataInputStream.java:160)
>   at 
> org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:184)
>   at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
>   at 
> org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107)
>   at 
> org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:348)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:143)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:125)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:121)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154)
>   at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139)
>   at 
> org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:108)
>   at 
> org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:48)
>   at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:142)
>   at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:117)
>   at 
> org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:297)
>   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:165)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at 
> java.security.AccessController.doPrivileged(AccessController.java:366)
>   at javax.security.auth.Subject.doAs(Subject.java:572)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1502)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> {code}
> Here is the my Mapper and Reducer methods:
> Mapper:
> public void map(AvroKey key, NullWritable value, Context 
> context) throws IOException, InterruptedException 
> Reducer:
> protected void reduce(CustomPartitionKeyClass key, 
> Iterable values, Context context) throws 
> IOException, InterruptedException 
> What bother me are the following facts:
> 1) All the mappers finish without error
> 2) Most of the reducers finish without error, but one reducer keeps failing 
> with the above error.
> 3) It looks like caused by the data? But keep in mind that all the avro 
> records passed the mapper side, but failed in one reducer. 
> 4) From the stacktrace, it looks like our reducer code was NOT invoked yet, 
> but failed 

[jira] [Updated] (AVRO-2050) Clear Array To Allow GC

2017-07-17 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2050:
--
Attachment: AVRO-2050.2.patch

[~nkollar]

This implementation is essentially an {{ArrayList}}.  The {{ArrayList}} 
overwrites the {{clear}} method because using the default {{AbstractList}} 
implementation requires instantiating an Iterator and then deleting each item 
in the Iterator one at a time.  This is bad performance in terms of constant 
stack manipulation, but also this amounts to draining the array from the head 
of the list.  Draining from the head requires an array copy for each item 
removed to shift down the existing records.  It is much better to override the 
method as {{ArrayList}} has done.  However, I did see some overlap with the 
{{toString}} and {{add}} methods which can be leveraged.  Changed the patch to 
remove the two overrides.

> Clear Array To Allow GC
> ---
>
> Key: AVRO-2050
> URL: https://issues.apache.org/jira/browse/AVRO-2050
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2050.1.patch, AVRO-2050.2.patch
>
>
> Java's {{ArrayList}} implementation clears all Objects from the internal 
> buffer when the {{clear()}} method is called.  This allows the Objects to be 
> free for GC.  We should do the same in Avro 
> {{org.apache.avro.generic.GenericData}} 
> [ArrayList 
> Source|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/ArrayList.java#ArrayList.clear%28%29]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2050) Clear Array To Allow GC

2017-07-16 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2050:
--
Status: Patch Available  (was: Open)

> Clear Array To Allow GC
> ---
>
> Key: AVRO-2050
> URL: https://issues.apache.org/jira/browse/AVRO-2050
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.8.2, 1.7.7
>Reporter: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2050.1.patch
>
>
> Java's {{ArrayList}} implementation clears all Objects from the internal 
> buffer when the {{clear()}} method is called.  This allows the Objects to be 
> free for GC.  We should do the same in Avro 
> {{org.apache.avro.generic.GenericData}} 
> [ArrayList 
> Source|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/ArrayList.java#ArrayList.clear%28%29]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2050) Clear Array To Allow GC

2017-07-16 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2050:
--
Attachment: (was: AVRO-2050.1.patch)

> Clear Array To Allow GC
> ---
>
> Key: AVRO-2050
> URL: https://issues.apache.org/jira/browse/AVRO-2050
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2050.1.patch
>
>
> Java's {{ArrayList}} implementation clears all Objects from the internal 
> buffer when the {{clear()}} method is called.  This allows the Objects to be 
> free for GC.  We should do the same in Avro 
> {{org.apache.avro.generic.GenericData}} 
> [ArrayList 
> Source|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/ArrayList.java#ArrayList.clear%28%29]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2050) Clear Array To Allow GC

2017-07-16 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2050:
--
Attachment: AVRO-2050.1.patch

> Clear Array To Allow GC
> ---
>
> Key: AVRO-2050
> URL: https://issues.apache.org/jira/browse/AVRO-2050
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2050.1.patch
>
>
> Java's {{ArrayList}} implementation clears all Objects from the internal 
> buffer when the {{clear()}} method is called.  This allows the Objects to be 
> free for GC.  We should do the same in Avro 
> {{org.apache.avro.generic.GenericData}} 
> [ArrayList 
> Source|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/ArrayList.java#ArrayList.clear%28%29]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2050) Clear Array To Allow GC

2017-07-16 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2050:
--
Attachment: AVRO-2050.1.patch

> Clear Array To Allow GC
> ---
>
> Key: AVRO-2050
> URL: https://issues.apache.org/jira/browse/AVRO-2050
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2050.1.patch
>
>
> Java's {{ArrayList}} implementation clears all Objects from the internal 
> buffer when the {{clear()}} method is called.  This allows the Objects to be 
> free for GC.  We should do the same in Avro 
> {{org.apache.avro.generic.GenericData}} 
> [ArrayList 
> Source|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/ArrayList.java#ArrayList.clear%28%29]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (AVRO-2050) Clear Array To Allow GC

2017-07-16 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created AVRO-2050:
-

 Summary: Clear Array To Allow GC
 Key: AVRO-2050
 URL: https://issues.apache.org/jira/browse/AVRO-2050
 Project: Avro
  Issue Type: Improvement
  Components: java
Affects Versions: 1.8.2, 1.7.7
Reporter: BELUGA BEHR
Priority: Minor


Java's {{ArrayList}} implementation clears all Objects from the internal buffer 
when the {{clear()}} method is called.  This allows the Objects to be free for 
GC.  We should do the same in Avro {{org.apache.avro.generic.GenericData}} 

[ArrayList 
Source|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/ArrayList.java#ArrayList.clear%28%29]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer

2017-07-16 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2049:
--
Attachment: AVRO-2049.2.patch

Updated patch to include a couple of examples of the same superfluous execution

> Remove Superfluous Configuration From AvroSerializer
> 
>
> Key: AVRO-2049
> URL: https://issues.apache.org/jira/browse/AVRO-2049
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2049.1.patch, AVRO-2049.2.patch
>
>
> In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the 
> Avro block size is configured with a hard-coded value and there is a request 
> to benchmark different buffer sizes.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   /**
>* The block size for the Avro encoder.
>*
>* This number was copied from the AvroSerialization of 
> org.apache.avro.mapred in Avro 1.5.1.
>*
>* TODO(gwu): Do some benchmarking with different numbers here to see if it 
> is important.
>*/
>   private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512;
>   /** An factory for creating Avro datum encoders. */
>   private static EncoderFactory mEncoderFactory
>   = new 
> EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES);
> {code}
> However, there is no need to benchmark, this setting is superfluous and is 
> ignored with the current implementation.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   @Override
>   public void open(OutputStream outputStream) throws IOException {
> mOutputStream = outputStream;
> mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder);
>   }
> {code}
> {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting.  
> This setting is only relevant for calls to 
> {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} 
>  which considers the configured "Block Size" for doing binary encoding of 
> blocked Array types as laid out in the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex].  
> It can simply be removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer

2017-07-16 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089211#comment-16089211
 ] 

BELUGA BEHR edited comment on AVRO-2049 at 7/17/17 1:21 AM:


Updated patch to include a couple of other instances of the same superfluous 
execution


was (Author: belugabehr):
Updated patch to include a couple of examples of the same superfluous execution

> Remove Superfluous Configuration From AvroSerializer
> 
>
> Key: AVRO-2049
> URL: https://issues.apache.org/jira/browse/AVRO-2049
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2049.1.patch, AVRO-2049.2.patch
>
>
> In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the 
> Avro block size is configured with a hard-coded value and there is a request 
> to benchmark different buffer sizes.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   /**
>* The block size for the Avro encoder.
>*
>* This number was copied from the AvroSerialization of 
> org.apache.avro.mapred in Avro 1.5.1.
>*
>* TODO(gwu): Do some benchmarking with different numbers here to see if it 
> is important.
>*/
>   private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512;
>   /** An factory for creating Avro datum encoders. */
>   private static EncoderFactory mEncoderFactory
>   = new 
> EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES);
> {code}
> However, there is no need to benchmark, this setting is superfluous and is 
> ignored with the current implementation.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   @Override
>   public void open(OutputStream outputStream) throws IOException {
> mOutputStream = outputStream;
> mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder);
>   }
> {code}
> {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting.  
> This setting is only relevant for calls to 
> {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} 
>  which considers the configured "Block Size" for doing binary encoding of 
> blocked Array types as laid out in the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex].  
> It can simply be removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer

2017-07-16 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2049:
--
Status: Patch Available  (was: Open)

> Remove Superfluous Configuration From AvroSerializer
> 
>
> Key: AVRO-2049
> URL: https://issues.apache.org/jira/browse/AVRO-2049
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.8.2, 1.7.7
>Reporter: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2049.1.patch
>
>
> In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the 
> Avro block size is configured with a hard-coded value and there is a request 
> to benchmark different buffer sizes.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   /**
>* The block size for the Avro encoder.
>*
>* This number was copied from the AvroSerialization of 
> org.apache.avro.mapred in Avro 1.5.1.
>*
>* TODO(gwu): Do some benchmarking with different numbers here to see if it 
> is important.
>*/
>   private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512;
>   /** An factory for creating Avro datum encoders. */
>   private static EncoderFactory mEncoderFactory
>   = new 
> EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES);
> {code}
> However, there is no need to benchmark, this setting is superfluous and is 
> ignored with the current implementation.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   @Override
>   public void open(OutputStream outputStream) throws IOException {
> mOutputStream = outputStream;
> mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder);
>   }
> {code}
> {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting.  
> This setting is only relevant for calls to 
> {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} 
>  which considers the configured "Block Size" for doing binary encoding of 
> blocked Array types as laid out in the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex].  
> It can simply be removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer

2017-07-16 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created AVRO-2049:
-

 Summary: Remove Superfluous Configuration From AvroSerializer
 Key: AVRO-2049
 URL: https://issues.apache.org/jira/browse/AVRO-2049
 Project: Avro
  Issue Type: Improvement
  Components: java
Affects Versions: 1.8.2, 1.7.7
Reporter: BELUGA BEHR
Priority: Trivial


In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the Avro 
block size is configured with a hard-coded value and there is a request to 
benchmark different buffer sizes.

{code:title=org.apache.avro.hadoop.io.AvroSerializer}
  /**
   * The block size for the Avro encoder.
   *
   * This number was copied from the AvroSerialization of 
org.apache.avro.mapred in Avro 1.5.1.
   *
   * TODO(gwu): Do some benchmarking with different numbers here to see if it 
is important.
   */
  private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512;

  /** An factory for creating Avro datum encoders. */
  private static EncoderFactory mEncoderFactory
  = new EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES);
{code}

However, there is no need to benchmark, this setting is superfluous and is 
ignored with the current implementation.

{code:title=org.apache.avro.hadoop.io.AvroSerializer}
  @Override
  public void open(OutputStream outputStream) throws IOException {
mOutputStream = outputStream;
mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder);
  }
{code}

{{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting.  This 
setting is only relevant for calls to 
{{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} 
 which considers the configured "Block Size" for doing binary encoding of 
blocked Array types as laid out in the 
[specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex].  It 
can simply be removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings

2017-07-14 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2048:
--
Description: 
According to the 
[specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]:

bq. a string is encoded as a *long* followed by that many bytes of UTF-8 
encoded character data.

However, that is currently not being adhered to:

{code:title=org.apache.avro.io.BinaryDecoder}
  @Override
  public Utf8 readString(Utf8 old) throws IOException {
int length = readInt();
Utf8 result = (old != null ? old : new Utf8());
result.setByteLength(length);
if (0 != length) {
  doReadBytes(result.getBytes(), 0, length);
}
return result;
  }
{code}

The first thing the code does here is to load an *int* value, not a *long*.  
Because of the variable length nature of the size, this will mostly work.  
However, there may be edge-cases where the serializer is putting in large 
length values erroneously or nefariously. Let us gracefully detect such 
scenarios and more closely adhere to the spec.

  was:
According to the 
[specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]:

bq. a string is encoded as a *long* followed by that many bytes of UTF-8 
encoded character data.

However, that is currently not being adhered to:

{code:title=org.apache.avro.io.BinaryDecoder}
  @Override
  public Utf8 readString(Utf8 old) throws IOException {
int length = readInt();
Utf8 result = (old != null ? old : new Utf8());
result.setByteLength(length);
if (0 != length) {
  doReadBytes(result.getBytes(), 0, length);
}
return result;
  }
{code}

The first thing the code does here is to load an *int* value, not a *long*.  
Because of the variable length nature of the size, this will mostly work.  
However, there may be edge-cases where this is broken and the serializer is 
putting in large values erroneously or nefariously. Let us gracefully handle to 
detect such scenarios and more closely adhere to the spec.


> Avro Binary Decoding - Gracefully Handle Long Strings
> -
>
> Key: AVRO-2048
> URL: https://issues.apache.org/jira/browse/AVRO-2048
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2048.1.patch
>
>
> According to the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]:
> bq. a string is encoded as a *long* followed by that many bytes of UTF-8 
> encoded character data.
> However, that is currently not being adhered to:
> {code:title=org.apache.avro.io.BinaryDecoder}
>   @Override
>   public Utf8 readString(Utf8 old) throws IOException {
> int length = readInt();
> Utf8 result = (old != null ? old : new Utf8());
> result.setByteLength(length);
> if (0 != length) {
>   doReadBytes(result.getBytes(), 0, length);
> }
> return result;
>   }
> {code}
> The first thing the code does here is to load an *int* value, not a *long*.  
> Because of the variable length nature of the size, this will mostly work.  
> However, there may be edge-cases where the serializer is putting in large 
> length values erroneously or nefariously. Let us gracefully detect such 
> scenarios and more closely adhere to the spec.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings

2017-07-14 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2048:
--
Attachment: AVRO-2048.1.patch

> Avro Binary Decoding - Gracefully Handle Long Strings
> -
>
> Key: AVRO-2048
> URL: https://issues.apache.org/jira/browse/AVRO-2048
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2048.1.patch
>
>
> According to the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]:
> bq. a string is encoded as a *long* followed by that many bytes of UTF-8 
> encoded character data.
> However, that is currently not being adhered to:
> {code:title=org.apache.avro.io.BinaryDecoder}
>   @Override
>   public Utf8 readString(Utf8 old) throws IOException {
> int length = readInt();
> Utf8 result = (old != null ? old : new Utf8());
> result.setByteLength(length);
> if (0 != length) {
>   doReadBytes(result.getBytes(), 0, length);
> }
> return result;
>   }
> {code}
> The first thing the code does here is to load an *int* value, not a *long*.  
> Because of the variable length nature of the size, this will mostly work.  
> However, there may be edge-cases where this is broken and the serializer is 
> putting in large values erroneously or nefariously. Let us gracefully handle 
> to detect such scenarios and more closely adhere to the spec.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings

2017-07-14 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2048:
--
Status: Patch Available  (was: Open)

> Avro Binary Decoding - Gracefully Handle Long Strings
> -
>
> Key: AVRO-2048
> URL: https://issues.apache.org/jira/browse/AVRO-2048
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.8.2, 1.7.7
>Reporter: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2048.1.patch
>
>
> According to the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]:
> bq. a string is encoded as a *long* followed by that many bytes of UTF-8 
> encoded character data.
> However, that is currently not being adhered to:
> {code:title=org.apache.avro.io.BinaryDecoder}
>   @Override
>   public Utf8 readString(Utf8 old) throws IOException {
> int length = readInt();
> Utf8 result = (old != null ? old : new Utf8());
> result.setByteLength(length);
> if (0 != length) {
>   doReadBytes(result.getBytes(), 0, length);
> }
> return result;
>   }
> {code}
> The first thing the code does here is to load an *int* value, not a *long*.  
> Because of the variable length nature of the size, this will mostly work.  
> However, there may be edge-cases where this is broken and the serializer is 
> putting in large values erroneously or nefariously. Let us gracefully handle 
> to detect such scenarios and more closely adhere to the spec.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings

2017-07-14 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created AVRO-2048:
-

 Summary: Avro Binary Decoding - Gracefully Handle Long Strings
 Key: AVRO-2048
 URL: https://issues.apache.org/jira/browse/AVRO-2048
 Project: Avro
  Issue Type: Improvement
  Components: java
Affects Versions: 1.8.2, 1.7.7
Reporter: BELUGA BEHR
Priority: Minor


According to the 
[specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]:

bq. a string is encoded as a *long* followed by that many bytes of UTF-8 
encoded character data.

However, that is currently not being adhered to:

{code:title=org.apache.avro.io.BinaryDecoder}
  @Override
  public Utf8 readString(Utf8 old) throws IOException {
int length = readInt();
Utf8 result = (old != null ? old : new Utf8());
result.setByteLength(length);
if (0 != length) {
  doReadBytes(result.getBytes(), 0, length);
}
return result;
  }
{code}

The first thing the code does here is to load an *int* value, not a *long*.  
Because of the variable length nature of the size, this will mostly work.  
However, there may be edge-cases where this is broken and the serializer is 
putting in large values erroneously or nefariously. Let us gracefully handle to 
detect such scenarios and more closely adhere to the spec.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)