[jira] [Updated] (AVRO-2058) ReflectData#isNonStringMap returns true for Utf8 keys

2017-07-26 Thread Sam Schlegel (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Schlegel updated AVRO-2058:
---
Status: Patch Available  (was: Open)

> ReflectData#isNonStringMap returns true for Utf8 keys
> -
>
> Key: AVRO-2058
> URL: https://issues.apache.org/jira/browse/AVRO-2058
> Project: Avro
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.8.2
>Reporter: Sam Schlegel
>Priority: Critical
> Attachments: AVRO-2058.patch
>
>
> Since {{Utf8}} does not have an {{Stringable}} notation, and is not in 
> {{SpecificData#stringableClasses}}, {{ReflectData#isNonStringMap}} returns 
> true. This also causes {{ReflectData#isArray}} to return true for maps with 
> Utf8 keys, and thus {{GenericData#resolveUnion}} fails as well. This 
> ultimately causes {{ReflectData#write}} to fail for schemas that contain a 
> union that contains a map, where the data uses Utf8 for strings.
> This following test case reproduces the issue:
> {code:java}
>   @Test public void testUnionWithMapWithUtf8Keys() {
> Schema s = new Schema.Parser().parse
>   ("[\"null\", {\"type\":\"map\",\"values\":\"float\"}]");
> GenericData data = ReflectData.get();
> HashMap map = new HashMap();
> map.put(new Utf8("foo"), 1.0f);
> assertEquals(1, data.resolveUnion(s, map));
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2058) ReflectData#isNonStringMap returns true for Utf8 keys

2017-07-26 Thread Sam Schlegel (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Schlegel updated AVRO-2058:
---
Description: 
Since {{Utf8}} does not have an {{Stringable}} notation, and is not in 
{{SpecificData#stringableClasses}}, {{ReflectData#isNonStringMap}} returns 
true. This also causes {{ReflectData#isArray}} to return true for maps with 
Utf8 keys, and thus {{GenericData#resolveUnion}} fails as well. This ultimately 
causes {{ReflectData#write}} to fail for schemas that contain a union that 
contains a map, where the data uses Utf8 for strings.

This following test case reproduces the issue:

{code:java}
  @Test public void testUnionWithMapWithUtf8Keys() {
Schema s = new Schema.Parser().parse
  ("[\"null\", {\"type\":\"map\",\"values\":\"float\"}]");
GenericData data = ReflectData.get();
HashMap map = new HashMap();
map.put(new Utf8("foo"), 1.0f);
assertEquals(1, data.resolveUnion(s, map));
  }
{code}

  was:Since {{Utf8}} does not have an {{Stringable}} notation, and is not in 
{{SpecificData#stringableClasses}}, {{ReflectData#isNonStringMap}} returns 
true. This also causes {{ReflectData#isArray}} to return true for maps with 
Utf8 keys, and thus {{GenericData#resolveUnion}} fails as well. This ultimately 
causes {{ReflectData#write}} to fail for schemas that contain a union that 
contains a map, where the data uses Utf8 for strings.


> ReflectData#isNonStringMap returns true for Utf8 keys
> -
>
> Key: AVRO-2058
> URL: https://issues.apache.org/jira/browse/AVRO-2058
> Project: Avro
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.8.2
>Reporter: Sam Schlegel
>Priority: Critical
>
> Since {{Utf8}} does not have an {{Stringable}} notation, and is not in 
> {{SpecificData#stringableClasses}}, {{ReflectData#isNonStringMap}} returns 
> true. This also causes {{ReflectData#isArray}} to return true for maps with 
> Utf8 keys, and thus {{GenericData#resolveUnion}} fails as well. This 
> ultimately causes {{ReflectData#write}} to fail for schemas that contain a 
> union that contains a map, where the data uses Utf8 for strings.
> This following test case reproduces the issue:
> {code:java}
>   @Test public void testUnionWithMapWithUtf8Keys() {
> Schema s = new Schema.Parser().parse
>   ("[\"null\", {\"type\":\"map\",\"values\":\"float\"}]");
> GenericData data = ReflectData.get();
> HashMap map = new HashMap();
> map.put(new Utf8("foo"), 1.0f);
> assertEquals(1, data.resolveUnion(s, map));
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AVRO-1855) Avro-mapred not evaluating map schema correctly when values are expected to be strings

2017-07-26 Thread Sam Schlegel (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102608#comment-16102608
 ] 

Sam Schlegel commented on AVRO-1855:


Upon further inspection I believe might be due to AVRO-2058

> Avro-mapred not evaluating map schema correctly when values are expected to 
> be strings
> --
>
> Key: AVRO-1855
> URL: https://issues.apache.org/jira/browse/AVRO-1855
> Project: Avro
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.8.0
>Reporter: Mikko Kupsu
>Priority: Critical
> Attachments: 20160530_AVRO-1855.patch
>
>
> When reading bunch of Avro file and concatenating them using avro-mapred, 
> there is an issue with following schema definition line:
> {code}
> {"name": "headers", "type": ["null", {"type": "map", "values": "string"}]},
> {code}
> Below exceptions are thrown:
> {code}
> Caused by: org.apache.avro.UnresolvedUnionException: Not in union 
> ["null",{"type":"map","values":"string"}]: {range=bytes=91553252-91557347, 
> accept=*/*, response_status_code=206, host=108.175.39.172}
>   at 
> org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:709)
>   at 
> org.apache.avro.generic.GenericDatumWriter.resolveUnion(GenericDatumWriter.java:192)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:110)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:73)
>   at 
> org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:150)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:153)
>   at 
> org.apache.avro.reflect.ReflectDatumWriter.writeField(ReflectDatumWriter.java:182)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:143)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:105)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:73)
>   at 
> org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:150)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:60)
>   at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:302)
> {code}
> I've fixed this in my own [GitHub 
> fork|https://github.com/mikkokupsu/avro/tree/hotfix/20160530/avro-schema-map-string-problem]
>  and I've attached the patch too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (AVRO-1855) Avro-mapred not evaluating map schema correctly when values are expected to be strings

2017-07-26 Thread Sam Schlegel (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102608#comment-16102608
 ] 

Sam Schlegel edited comment on AVRO-1855 at 7/27/17 2:11 AM:
-

Upon further inspection I believe might be due to AVRO-2058.


was (Author: samschlegel):
Upon further inspection I believe might be due to AVRO-2058

> Avro-mapred not evaluating map schema correctly when values are expected to 
> be strings
> --
>
> Key: AVRO-1855
> URL: https://issues.apache.org/jira/browse/AVRO-1855
> Project: Avro
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.8.0
>Reporter: Mikko Kupsu
>Priority: Critical
> Attachments: 20160530_AVRO-1855.patch
>
>
> When reading bunch of Avro file and concatenating them using avro-mapred, 
> there is an issue with following schema definition line:
> {code}
> {"name": "headers", "type": ["null", {"type": "map", "values": "string"}]},
> {code}
> Below exceptions are thrown:
> {code}
> Caused by: org.apache.avro.UnresolvedUnionException: Not in union 
> ["null",{"type":"map","values":"string"}]: {range=bytes=91553252-91557347, 
> accept=*/*, response_status_code=206, host=108.175.39.172}
>   at 
> org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:709)
>   at 
> org.apache.avro.generic.GenericDatumWriter.resolveUnion(GenericDatumWriter.java:192)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:110)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:73)
>   at 
> org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:150)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:153)
>   at 
> org.apache.avro.reflect.ReflectDatumWriter.writeField(ReflectDatumWriter.java:182)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:143)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:105)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:73)
>   at 
> org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:150)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:60)
>   at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:302)
> {code}
> I've fixed this in my own [GitHub 
> fork|https://github.com/mikkokupsu/avro/tree/hotfix/20160530/avro-schema-map-string-problem]
>  and I've attached the patch too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2058) ReflectData#isNonStringMap returns true for Utf8 keys

2017-07-26 Thread Sam Schlegel (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Schlegel updated AVRO-2058:
---
Description: Since {{Utf8}} does not have an {{Stringable}} notation, and 
is not in {{SpecificData#stringableClasses}}, {{ReflectData#isNonStringMap}} 
returns true. This also causes {{ReflectData#isArray}} to return true for maps 
with Utf8 keys, and thus {{GenericData#resolveUnion}} fails as well. This 
ultimately causes {{ReflectData#write}} to fail for schemas that contain a 
union that contains a map, where the data uses Utf8 for strings.  (was: Since 
{{org.apache.avro.util.Utf8}} does not have an 
{{org.apache.reflect.Stringable}} notation, and is not in 
{{org.apache.avro.specific.SpecificData#stringableClasses}}, 
{{ReflectData#isNonStringMap}} returns true. This also causes 
{{ReflectData#isArray}} to return true for maps with Utf8 keys, and thus 
{{GenericData#resolveUnion}} fails as well. This ultimately causes 
{{ReflectData#write}} to fail for schemas that contain a union that contains a 
map, where the data uses Utf8 for strings.)
Summary: ReflectData#isNonStringMap returns true for Utf8 keys  (was: 
ReflectData#isNonStringMap returns true for org.apache.avro.util.Utf8 keys)

> ReflectData#isNonStringMap returns true for Utf8 keys
> -
>
> Key: AVRO-2058
> URL: https://issues.apache.org/jira/browse/AVRO-2058
> Project: Avro
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.8.2
>Reporter: Sam Schlegel
>Priority: Critical
>
> Since {{Utf8}} does not have an {{Stringable}} notation, and is not in 
> {{SpecificData#stringableClasses}}, {{ReflectData#isNonStringMap}} returns 
> true. This also causes {{ReflectData#isArray}} to return true for maps with 
> Utf8 keys, and thus {{GenericData#resolveUnion}} fails as well. This 
> ultimately causes {{ReflectData#write}} to fail for schemas that contain a 
> union that contains a map, where the data uses Utf8 for strings.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (AVRO-2058) ReflectData#isNonStringMap returns true for org.apache.avro.util.Utf8 keys

2017-07-26 Thread Sam Schlegel (JIRA)
Sam Schlegel created AVRO-2058:
--

 Summary: ReflectData#isNonStringMap returns true for 
org.apache.avro.util.Utf8 keys
 Key: AVRO-2058
 URL: https://issues.apache.org/jira/browse/AVRO-2058
 Project: Avro
  Issue Type: Bug
  Components: java
Affects Versions: 1.8.2
Reporter: Sam Schlegel
Priority: Critical


Since {{org.apache.avro.util.Utf8}} does not have an 
{{org.apache.reflect.Stringable}} notation, and is not in 
{{org.apache.avro.specific.SpecificData#stringableClasses}}, 
{{ReflectData#isNonStringMap}} returns true. This also causes 
{{ReflectData#isArray}} to return true for maps with Utf8 keys, and thus 
{{GenericData#resolveUnion}} fails as well. This ultimately causes 
{{ReflectData#write}} to fail for schemas that contain a union that contains a 
map, where the data uses Utf8 for strings.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AVRO-1855) Avro-mapred not evaluating map schema correctly when values are expected to be strings

2017-07-26 Thread Sam Schlegel (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102544#comment-16102544
 ] 

Sam Schlegel commented on AVRO-1855:


I believe this related to AVRO-966 and is caused by the {{isArray}} check 
coming before {{isMap}} in 
{{org.apache.avro.generic.GenericData#getSchemaName}}. I'm not sure why, but 
{{isArray(datum)}} is returning true when {{datum}} is a `java.util.HashMap`, 
even though {{java.util.HashMap}} is not an instance of 
{{java.util.Collection}} and {{datum instanceof Collection}} returns false when 
evaluated in the debugger.

> Avro-mapred not evaluating map schema correctly when values are expected to 
> be strings
> --
>
> Key: AVRO-1855
> URL: https://issues.apache.org/jira/browse/AVRO-1855
> Project: Avro
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.8.0
>Reporter: Mikko Kupsu
>Priority: Critical
> Attachments: 20160530_AVRO-1855.patch
>
>
> When reading bunch of Avro file and concatenating them using avro-mapred, 
> there is an issue with following schema definition line:
> {code}
> {"name": "headers", "type": ["null", {"type": "map", "values": "string"}]},
> {code}
> Below exceptions are thrown:
> {code}
> Caused by: org.apache.avro.UnresolvedUnionException: Not in union 
> ["null",{"type":"map","values":"string"}]: {range=bytes=91553252-91557347, 
> accept=*/*, response_status_code=206, host=108.175.39.172}
>   at 
> org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:709)
>   at 
> org.apache.avro.generic.GenericDatumWriter.resolveUnion(GenericDatumWriter.java:192)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:110)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:73)
>   at 
> org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:150)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:153)
>   at 
> org.apache.avro.reflect.ReflectDatumWriter.writeField(ReflectDatumWriter.java:182)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:143)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:105)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:73)
>   at 
> org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:150)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:60)
>   at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:302)
> {code}
> I've fixed this in my own [GitHub 
> fork|https://github.com/mikkokupsu/avro/tree/hotfix/20160530/avro-schema-map-string-problem]
>  and I've attached the patch too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble

2017-07-26 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2056:
--
Status: Patch Available  (was: In Progress)

> DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
> ---
>
> Key: AVRO-2056
> URL: https://issues.apache.org/jira/browse/AVRO-2056
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.8.2, 1.7.7
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2056.1.patch
>
>
> Each call to {{writeDouble}} creates a new buffer and promptly throws it away 
> even though the class has a re-usable buffer and is used in other methods 
> such as {{writeFloat}}.  Remove this extra buffer.
> {code:title=org.apache.avro.io.DirectBinaryEncoder}
>   // the buffer is used for writing floats, doubles, and large longs.
>   private final byte[] buf = new byte[12];
>   @Override
>   public void writeFloat(float f) throws IOException {
> int len = BinaryData.encodeFloat(f, buf, 0);
> out.write(buf, 0, len);
>   }
>   @Override
>   public void writeDouble(double d) throws IOException {
> byte[] buf = new byte[8];
> int len = BinaryData.encodeDouble(d, buf, 0);
> out.write(buf, 0, len);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2054) Use StringBuilder instead of StringBuffer

2017-07-26 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2054:
--
Status: Patch Available  (was: In Progress)

> Use StringBuilder instead of StringBuffer
> -
>
> Key: AVRO-2054
> URL: https://issues.apache.org/jira/browse/AVRO-2054
> Project: Avro
>  Issue Type: Improvement
>Affects Versions: 1.8.2, 1.7.7
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2054.1.patch, AVRO-2054.2.patch
>
>
> Use the un-synchronized StringBuilder instead of StringBuffer.  Use _char_ 
> values instead of Strings.
> {code:title=org.apache.trevni.MetaData}
>   @Override public String toString() {
> StringBuffer buffer = new StringBuffer();
> buffer.append("{ ");
> for (Map.Entry e : entrySet()) {
>   buffer.append(e.getKey());
>   buffer.append("=");
>   try {
> buffer.append(new String(e.getValue(), "ISO-8859-1"));
>   } catch (java.io.UnsupportedEncodingException error) {
> throw new TrevniRuntimeException(error);
>   }
>   buffer.append(" ");
> }
> buffer.append("}");
> return buffer.toString();
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings

2017-07-26 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2048:
--
Status: Patch Available  (was: In Progress)

> Avro Binary Decoding - Gracefully Handle Long Strings
> -
>
> Key: AVRO-2048
> URL: https://issues.apache.org/jira/browse/AVRO-2048
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.8.2, 1.7.7
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2048.1.patch, AVRO-2048.2.patch, AVRO-2048.3.patch
>
>
> According to the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]:
> bq. a string is encoded as a *long* followed by that many bytes of UTF-8 
> encoded character data.
> However, that is currently not being adhered to:
> {code:title=org.apache.avro.io.BinaryDecoder}
>   @Override
>   public Utf8 readString(Utf8 old) throws IOException {
> int length = readInt();
> Utf8 result = (old != null ? old : new Utf8());
> result.setByteLength(length);
> if (0 != length) {
>   doReadBytes(result.getBytes(), 0, length);
> }
> return result;
>   }
> {code}
> The first thing the code does here is to load an *int* value, not a *long*.  
> Because of the variable length nature of the size, this will mostly work.  
> However, there may be edge-cases where the serializer is putting in large 
> length values erroneously or nefariously. Let us gracefully detect such 
> scenarios and more closely adhere to the spec.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer

2017-07-26 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2049:
--
Status: Patch Available  (was: In Progress)

> Remove Superfluous Configuration From AvroSerializer
> 
>
> Key: AVRO-2049
> URL: https://issues.apache.org/jira/browse/AVRO-2049
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.8.2, 1.7.7
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2049.1.patch, AVRO-2049.2.patch, AVRO-2049.3.patch
>
>
> In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the 
> Avro block size is configured with a hard-coded value and there is a request 
> to benchmark different buffer sizes.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   /**
>* The block size for the Avro encoder.
>*
>* This number was copied from the AvroSerialization of 
> org.apache.avro.mapred in Avro 1.5.1.
>*
>* TODO(gwu): Do some benchmarking with different numbers here to see if it 
> is important.
>*/
>   private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512;
>   /** An factory for creating Avro datum encoders. */
>   private static EncoderFactory mEncoderFactory
>   = new 
> EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES);
> {code}
> However, there is no need to benchmark, this setting is superfluous and is 
> ignored with the current implementation.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   @Override
>   public void open(OutputStream outputStream) throws IOException {
> mOutputStream = outputStream;
> mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder);
>   }
> {code}
> {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting.  
> This setting is only relevant for calls to 
> {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} 
>  which considers the configured "Block Size" for doing binary encoding of 
> blocked Array types as laid out in the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex].  
> It can simply be removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2054) Use StringBuilder instead of StringBuffer

2017-07-26 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2054:
--
Status: In Progress  (was: Patch Available)

> Use StringBuilder instead of StringBuffer
> -
>
> Key: AVRO-2054
> URL: https://issues.apache.org/jira/browse/AVRO-2054
> Project: Avro
>  Issue Type: Improvement
>Affects Versions: 1.8.2, 1.7.7
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2054.1.patch, AVRO-2054.2.patch
>
>
> Use the un-synchronized StringBuilder instead of StringBuffer.  Use _char_ 
> values instead of Strings.
> {code:title=org.apache.trevni.MetaData}
>   @Override public String toString() {
> StringBuffer buffer = new StringBuffer();
> buffer.append("{ ");
> for (Map.Entry e : entrySet()) {
>   buffer.append(e.getKey());
>   buffer.append("=");
>   try {
> buffer.append(new String(e.getValue(), "ISO-8859-1"));
>   } catch (java.io.UnsupportedEncodingException error) {
> throw new TrevniRuntimeException(error);
>   }
>   buffer.append(" ");
> }
> buffer.append("}");
> return buffer.toString();
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer

2017-07-26 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2049:
--
Status: In Progress  (was: Patch Available)

> Remove Superfluous Configuration From AvroSerializer
> 
>
> Key: AVRO-2049
> URL: https://issues.apache.org/jira/browse/AVRO-2049
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.8.2, 1.7.7
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2049.1.patch, AVRO-2049.2.patch, AVRO-2049.3.patch
>
>
> In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the 
> Avro block size is configured with a hard-coded value and there is a request 
> to benchmark different buffer sizes.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   /**
>* The block size for the Avro encoder.
>*
>* This number was copied from the AvroSerialization of 
> org.apache.avro.mapred in Avro 1.5.1.
>*
>* TODO(gwu): Do some benchmarking with different numbers here to see if it 
> is important.
>*/
>   private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512;
>   /** An factory for creating Avro datum encoders. */
>   private static EncoderFactory mEncoderFactory
>   = new 
> EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES);
> {code}
> However, there is no need to benchmark, this setting is superfluous and is 
> ignored with the current implementation.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   @Override
>   public void open(OutputStream outputStream) throws IOException {
> mOutputStream = outputStream;
> mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder);
>   }
> {code}
> {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting.  
> This setting is only relevant for calls to 
> {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} 
>  which considers the configured "Block Size" for doing binary encoding of 
> blocked Array types as laid out in the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex].  
> It can simply be removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer

2017-07-26 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2049:
--
Status: Patch Available  (was: In Progress)

> Remove Superfluous Configuration From AvroSerializer
> 
>
> Key: AVRO-2049
> URL: https://issues.apache.org/jira/browse/AVRO-2049
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.8.2, 1.7.7
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2049.1.patch, AVRO-2049.2.patch, AVRO-2049.3.patch
>
>
> In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the 
> Avro block size is configured with a hard-coded value and there is a request 
> to benchmark different buffer sizes.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   /**
>* The block size for the Avro encoder.
>*
>* This number was copied from the AvroSerialization of 
> org.apache.avro.mapred in Avro 1.5.1.
>*
>* TODO(gwu): Do some benchmarking with different numbers here to see if it 
> is important.
>*/
>   private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512;
>   /** An factory for creating Avro datum encoders. */
>   private static EncoderFactory mEncoderFactory
>   = new 
> EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES);
> {code}
> However, there is no need to benchmark, this setting is superfluous and is 
> ignored with the current implementation.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   @Override
>   public void open(OutputStream outputStream) throws IOException {
> mOutputStream = outputStream;
> mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder);
>   }
> {code}
> {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting.  
> This setting is only relevant for calls to 
> {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} 
>  which considers the configured "Block Size" for doing binary encoding of 
> blocked Array types as laid out in the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex].  
> It can simply be removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer

2017-07-26 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101827#comment-16101827
 ] 

BELUGA BEHR commented on AVRO-2049:
---

[~nkollar] Maybe so, we can reuse encoder.  I'm not sure how often "open" is 
actually called though.  New ticket?

> Remove Superfluous Configuration From AvroSerializer
> 
>
> Key: AVRO-2049
> URL: https://issues.apache.org/jira/browse/AVRO-2049
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2049.1.patch, AVRO-2049.2.patch, AVRO-2049.3.patch
>
>
> In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the 
> Avro block size is configured with a hard-coded value and there is a request 
> to benchmark different buffer sizes.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   /**
>* The block size for the Avro encoder.
>*
>* This number was copied from the AvroSerialization of 
> org.apache.avro.mapred in Avro 1.5.1.
>*
>* TODO(gwu): Do some benchmarking with different numbers here to see if it 
> is important.
>*/
>   private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512;
>   /** An factory for creating Avro datum encoders. */
>   private static EncoderFactory mEncoderFactory
>   = new 
> EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES);
> {code}
> However, there is no need to benchmark, this setting is superfluous and is 
> ignored with the current implementation.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   @Override
>   public void open(OutputStream outputStream) throws IOException {
> mOutputStream = outputStream;
> mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder);
>   }
> {code}
> {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting.  
> This setting is only relevant for calls to 
> {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} 
>  which considers the configured "Block Size" for doing binary encoding of 
> blocked Array types as laid out in the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex].  
> It can simply be removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble

2017-07-26 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2056:
--
Status: In Progress  (was: Patch Available)

> DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
> ---
>
> Key: AVRO-2056
> URL: https://issues.apache.org/jira/browse/AVRO-2056
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.8.2, 1.7.7
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2056.1.patch
>
>
> Each call to {{writeDouble}} creates a new buffer and promptly throws it away 
> even though the class has a re-usable buffer and is used in other methods 
> such as {{writeFloat}}.  Remove this extra buffer.
> {code:title=org.apache.avro.io.DirectBinaryEncoder}
>   // the buffer is used for writing floats, doubles, and large longs.
>   private final byte[] buf = new byte[12];
>   @Override
>   public void writeFloat(float f) throws IOException {
> int len = BinaryData.encodeFloat(f, buf, 0);
> out.write(buf, 0, len);
>   }
>   @Override
>   public void writeDouble(double d) throws IOException {
> byte[] buf = new byte[8];
> int len = BinaryData.encodeDouble(d, buf, 0);
> out.write(buf, 0, len);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AVRO-2056) DirectBinaryEncoder Creates Buffer For Each Call To writeDouble

2017-07-26 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101805#comment-16101805
 ] 

BELUGA BEHR commented on AVRO-2056:
---

[~gszadovszky] I'm sorry, but I do not understand your request to "add the Perf 
change."  I did not change the Perf test to accomplish the testing other than 
un-comment the direct binary encoder:

{code}
  Encoder e = encoder_factory.binaryEncoder(out, null);
//Encoder e = encoder_factory.directBinaryEncoder(out, null);
{code}

> DirectBinaryEncoder Creates Buffer For Each Call To writeDouble
> ---
>
> Key: AVRO-2056
> URL: https://issues.apache.org/jira/browse/AVRO-2056
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2056.1.patch
>
>
> Each call to {{writeDouble}} creates a new buffer and promptly throws it away 
> even though the class has a re-usable buffer and is used in other methods 
> such as {{writeFloat}}.  Remove this extra buffer.
> {code:title=org.apache.avro.io.DirectBinaryEncoder}
>   // the buffer is used for writing floats, doubles, and large longs.
>   private final byte[] buf = new byte[12];
>   @Override
>   public void writeFloat(float f) throws IOException {
> int len = BinaryData.encodeFloat(f, buf, 0);
> out.write(buf, 0, len);
>   }
>   @Override
>   public void writeDouble(double d) throws IOException {
> byte[] buf = new byte[8];
> int len = BinaryData.encodeDouble(d, buf, 0);
> out.write(buf, 0, len);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer

2017-07-26 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101798#comment-16101798
 ] 

BELUGA BEHR commented on AVRO-2049:
---

[~gszadovszky] Good catch.   Thanks.  Perhaps we should create a new ticket to 
remove the magic number.

> Remove Superfluous Configuration From AvroSerializer
> 
>
> Key: AVRO-2049
> URL: https://issues.apache.org/jira/browse/AVRO-2049
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2049.1.patch, AVRO-2049.2.patch, AVRO-2049.3.patch
>
>
> In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the 
> Avro block size is configured with a hard-coded value and there is a request 
> to benchmark different buffer sizes.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   /**
>* The block size for the Avro encoder.
>*
>* This number was copied from the AvroSerialization of 
> org.apache.avro.mapred in Avro 1.5.1.
>*
>* TODO(gwu): Do some benchmarking with different numbers here to see if it 
> is important.
>*/
>   private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512;
>   /** An factory for creating Avro datum encoders. */
>   private static EncoderFactory mEncoderFactory
>   = new 
> EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES);
> {code}
> However, there is no need to benchmark, this setting is superfluous and is 
> ignored with the current implementation.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   @Override
>   public void open(OutputStream outputStream) throws IOException {
> mOutputStream = outputStream;
> mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder);
>   }
> {code}
> {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting.  
> This setting is only relevant for calls to 
> {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} 
>  which considers the configured "Block Size" for doing binary encoding of 
> blocked Array types as laid out in the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex].  
> It can simply be removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer

2017-07-26 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2049:
--
Attachment: AVRO-2049.3.patch

> Remove Superfluous Configuration From AvroSerializer
> 
>
> Key: AVRO-2049
> URL: https://issues.apache.org/jira/browse/AVRO-2049
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2049.1.patch, AVRO-2049.2.patch, AVRO-2049.3.patch
>
>
> In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the 
> Avro block size is configured with a hard-coded value and there is a request 
> to benchmark different buffer sizes.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   /**
>* The block size for the Avro encoder.
>*
>* This number was copied from the AvroSerialization of 
> org.apache.avro.mapred in Avro 1.5.1.
>*
>* TODO(gwu): Do some benchmarking with different numbers here to see if it 
> is important.
>*/
>   private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512;
>   /** An factory for creating Avro datum encoders. */
>   private static EncoderFactory mEncoderFactory
>   = new 
> EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES);
> {code}
> However, there is no need to benchmark, this setting is superfluous and is 
> ignored with the current implementation.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   @Override
>   public void open(OutputStream outputStream) throws IOException {
> mOutputStream = outputStream;
> mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder);
>   }
> {code}
> {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting.  
> This setting is only relevant for calls to 
> {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} 
>  which considers the configured "Block Size" for doing binary encoding of 
> blocked Array types as laid out in the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex].  
> It can simply be removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer

2017-07-26 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2049:
--
Status: In Progress  (was: Patch Available)

> Remove Superfluous Configuration From AvroSerializer
> 
>
> Key: AVRO-2049
> URL: https://issues.apache.org/jira/browse/AVRO-2049
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.8.2, 1.7.7
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2049.1.patch, AVRO-2049.2.patch, AVRO-2049.3.patch
>
>
> In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the 
> Avro block size is configured with a hard-coded value and there is a request 
> to benchmark different buffer sizes.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   /**
>* The block size for the Avro encoder.
>*
>* This number was copied from the AvroSerialization of 
> org.apache.avro.mapred in Avro 1.5.1.
>*
>* TODO(gwu): Do some benchmarking with different numbers here to see if it 
> is important.
>*/
>   private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512;
>   /** An factory for creating Avro datum encoders. */
>   private static EncoderFactory mEncoderFactory
>   = new 
> EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES);
> {code}
> However, there is no need to benchmark, this setting is superfluous and is 
> ignored with the current implementation.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   @Override
>   public void open(OutputStream outputStream) throws IOException {
> mOutputStream = outputStream;
> mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder);
>   }
> {code}
> {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting.  
> This setting is only relevant for calls to 
> {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} 
>  which considers the configured "Block Size" for doing binary encoding of 
> blocked Array types as laid out in the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex].  
> It can simply be removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings

2017-07-26 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101786#comment-16101786
 ] 

BELUGA BEHR commented on AVRO-2048:
---

[~gszadovszky] I borrowed same implementation from 
[here|http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/tip/src/share/classes/java/util/ArrayList.java#l190]

> Avro Binary Decoding - Gracefully Handle Long Strings
> -
>
> Key: AVRO-2048
> URL: https://issues.apache.org/jira/browse/AVRO-2048
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2048.1.patch, AVRO-2048.2.patch, AVRO-2048.3.patch
>
>
> According to the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]:
> bq. a string is encoded as a *long* followed by that many bytes of UTF-8 
> encoded character data.
> However, that is currently not being adhered to:
> {code:title=org.apache.avro.io.BinaryDecoder}
>   @Override
>   public Utf8 readString(Utf8 old) throws IOException {
> int length = readInt();
> Utf8 result = (old != null ? old : new Utf8());
> result.setByteLength(length);
> if (0 != length) {
>   doReadBytes(result.getBytes(), 0, length);
> }
> return result;
>   }
> {code}
> The first thing the code does here is to load an *int* value, not a *long*.  
> Because of the variable length nature of the size, this will mostly work.  
> However, there may be edge-cases where the serializer is putting in large 
> length values erroneously or nefariously. Let us gracefully detect such 
> scenarios and more closely adhere to the spec.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings

2017-07-26 Thread BELUGA BEHR (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated AVRO-2048:
--
Attachment: AVRO-2048.3.patch

> Avro Binary Decoding - Gracefully Handle Long Strings
> -
>
> Key: AVRO-2048
> URL: https://issues.apache.org/jira/browse/AVRO-2048
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2048.1.patch, AVRO-2048.2.patch, AVRO-2048.3.patch
>
>
> According to the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]:
> bq. a string is encoded as a *long* followed by that many bytes of UTF-8 
> encoded character data.
> However, that is currently not being adhered to:
> {code:title=org.apache.avro.io.BinaryDecoder}
>   @Override
>   public Utf8 readString(Utf8 old) throws IOException {
> int length = readInt();
> Utf8 result = (old != null ? old : new Utf8());
> result.setByteLength(length);
> if (0 != length) {
>   doReadBytes(result.getBytes(), 0, length);
> }
> return result;
>   }
> {code}
> The first thing the code does here is to load an *int* value, not a *long*.  
> Because of the variable length nature of the size, this will mostly work.  
> However, there may be edge-cases where the serializer is putting in large 
> length values erroneously or nefariously. Let us gracefully detect such 
> scenarios and more closely adhere to the spec.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AVRO-2048) Avro Binary Decoding - Gracefully Handle Long Strings

2017-07-26 Thread Gabor Szadovszky (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101657#comment-16101657
 ] 

Gabor Szadovszky commented on AVRO-2048:


[~belugabehr], the actual string size (or any array size) usually cannot be 
{{Integer.MAX_VALUE}} as "_Some VMs reserve some header words in an array._".
See e.g. {{java.util.ArrayList.MAX_ARRAY_SIZE}}.

> Avro Binary Decoding - Gracefully Handle Long Strings
> -
>
> Key: AVRO-2048
> URL: https://issues.apache.org/jira/browse/AVRO-2048
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: AVRO-2048.1.patch, AVRO-2048.2.patch
>
>
> According to the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_primitive]:
> bq. a string is encoded as a *long* followed by that many bytes of UTF-8 
> encoded character data.
> However, that is currently not being adhered to:
> {code:title=org.apache.avro.io.BinaryDecoder}
>   @Override
>   public Utf8 readString(Utf8 old) throws IOException {
> int length = readInt();
> Utf8 result = (old != null ? old : new Utf8());
> result.setByteLength(length);
> if (0 != length) {
>   doReadBytes(result.getBytes(), 0, length);
> }
> return result;
>   }
> {code}
> The first thing the code does here is to load an *int* value, not a *long*.  
> Because of the variable length nature of the size, this will mostly work.  
> However, there may be edge-cases where the serializer is putting in large 
> length values erroneously or nefariously. Let us gracefully detect such 
> scenarios and more closely adhere to the spec.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer

2017-07-26 Thread Gabor Szadovszky (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101604#comment-16101604
 ] 

Gabor Szadovszky commented on AVRO-2049:


[~belugabehr], as far as I can see {{ResolvingGramarGenerator}} sets the 
_bufferSize_ (not the _blockSize_) which is used when a new _binaryEncoder_ is 
created. With your change the _bufferSize_ of the _binaryEncoder_ will be 
{{2048}} instead of {{32}} in {{ResolvingGrammarGenerator.getBinary}}.

> Remove Superfluous Configuration From AvroSerializer
> 
>
> Key: AVRO-2049
> URL: https://issues.apache.org/jira/browse/AVRO-2049
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2049.1.patch, AVRO-2049.2.patch
>
>
> In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the 
> Avro block size is configured with a hard-coded value and there is a request 
> to benchmark different buffer sizes.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   /**
>* The block size for the Avro encoder.
>*
>* This number was copied from the AvroSerialization of 
> org.apache.avro.mapred in Avro 1.5.1.
>*
>* TODO(gwu): Do some benchmarking with different numbers here to see if it 
> is important.
>*/
>   private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512;
>   /** An factory for creating Avro datum encoders. */
>   private static EncoderFactory mEncoderFactory
>   = new 
> EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES);
> {code}
> However, there is no need to benchmark, this setting is superfluous and is 
> ignored with the current implementation.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   @Override
>   public void open(OutputStream outputStream) throws IOException {
> mOutputStream = outputStream;
> mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder);
>   }
> {code}
> {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting.  
> This setting is only relevant for calls to 
> {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} 
>  which considers the configured "Block Size" for doing binary encoding of 
> blocked Array types as laid out in the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex].  
> It can simply be removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AVRO-2049) Remove Superfluous Configuration From AvroSerializer

2017-07-26 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101553#comment-16101553
 ] 

Nandor Kollar commented on AVRO-2049:
-

Looks good to me, just one minor comment, in {{AvroSerialization}} I think we 
can reuse the {{encoder}} by passing it to the factory method instead of null: 
{{this.encoder = new EncoderFactory().binaryEncoder(out, encoder);}}

> Remove Superfluous Configuration From AvroSerializer
> 
>
> Key: AVRO-2049
> URL: https://issues.apache.org/jira/browse/AVRO-2049
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2049.1.patch, AVRO-2049.2.patch
>
>
> In the class {{org.apache.avro.hadoop.io.AvroSerializer}}, we see that the 
> Avro block size is configured with a hard-coded value and there is a request 
> to benchmark different buffer sizes.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   /**
>* The block size for the Avro encoder.
>*
>* This number was copied from the AvroSerialization of 
> org.apache.avro.mapred in Avro 1.5.1.
>*
>* TODO(gwu): Do some benchmarking with different numbers here to see if it 
> is important.
>*/
>   private static final int AVRO_ENCODER_BLOCK_SIZE_BYTES = 512;
>   /** An factory for creating Avro datum encoders. */
>   private static EncoderFactory mEncoderFactory
>   = new 
> EncoderFactory().configureBlockSize(AVRO_ENCODER_BLOCK_SIZE_BYTES);
> {code}
> However, there is no need to benchmark, this setting is superfluous and is 
> ignored with the current implementation.
> {code:title=org.apache.avro.hadoop.io.AvroSerializer}
>   @Override
>   public void open(OutputStream outputStream) throws IOException {
> mOutputStream = outputStream;
> mAvroEncoder = mEncoderFactory.binaryEncoder(outputStream, mAvroEncoder);
>   }
> {code}
> {{org.apache.avro.io.EncoderFactory.binaryEncoder}} ignores this setting.  
> This setting is only relevant for calls to 
> {{org.apache.avro.io.EncoderFactory.blockingBinaryEncoder}} 
>  which considers the configured "Block Size" for doing binary encoding of 
> blocked Array types as laid out in the 
> [specs|https://avro.apache.org/docs/1.8.2/spec.html#binary_encode_complex].  
> It can simply be removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AVRO-2053) Remove Reference To Deprecated Property mapred.output.compression.type

2017-07-26 Thread Gabor Szadovszky (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101457#comment-16101457
 ] 

Gabor Szadovszky commented on AVRO-2053:


The property {{mapred.output.compression.type}} had been deprecated in hadoop 
{{2.7.3}}. Based on the actual {{pom.xml}} we still supports haddop1 which uses 
this property. So the question is when do we want to officially drop hadoop1 
support.
Maybe, we should only add this change to the next major release ({{1.9.0}}) 
after the hadoop1 dependency is removed.

> Remove Reference To Deprecated Property mapred.output.compression.type
> --
>
> Key: AVRO-2053
> URL: https://issues.apache.org/jira/browse/AVRO-2053
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2053.1.patch
>
>
> Avro utilizes 
> [deprecated|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/DeprecatedProperties.html]
>  property _mapred.output.compression.type_.  Update code to use the MRv2 
> property and don't override default behaviors/settings.  Use the appropriate 
> facilities from {{org.apache.hadoop.mapreduce.lib.output.FileOutputFormat}} 
> and {{org.apache.hadoop.io.SequenceFile}}.
> {code:title=org.apache.avro.mapreduce.AvroSequenceFileOutputFormat}
>   /** Configuration key for storing the type of compression for the target 
> sequence file. */
>   private static final String CONF_COMPRESSION_TYPE = 
> "mapred.output.compression.type";
>   /** The default compression type for the target sequence file. */
>   private static final CompressionType DEFAULT_COMPRESSION_TYPE = 
> CompressionType.RECORD;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AVRO-2054) Use StringBuilder instead of StringBuffer

2017-07-26 Thread Gabor Szadovszky (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101405#comment-16101405
 ] 

Gabor Szadovszky commented on AVRO-2054:


+1

> Use StringBuilder instead of StringBuffer
> -
>
> Key: AVRO-2054
> URL: https://issues.apache.org/jira/browse/AVRO-2054
> Project: Avro
>  Issue Type: Improvement
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2054.1.patch, AVRO-2054.2.patch
>
>
> Use the un-synchronized StringBuilder instead of StringBuffer.  Use _char_ 
> values instead of Strings.
> {code:title=org.apache.trevni.MetaData}
>   @Override public String toString() {
> StringBuffer buffer = new StringBuffer();
> buffer.append("{ ");
> for (Map.Entry e : entrySet()) {
>   buffer.append(e.getKey());
>   buffer.append("=");
>   try {
> buffer.append(new String(e.getValue(), "ISO-8859-1"));
>   } catch (java.io.UnsupportedEncodingException error) {
> throw new TrevniRuntimeException(error);
>   }
>   buffer.append(" ");
> }
> buffer.append("}");
> return buffer.toString();
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AVRO-2055) Remove Magic Value From org.apache.avro.hadoop.io.AvroSequenceFile

2017-07-26 Thread Gabor Szadovszky (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101345#comment-16101345
 ] 

Gabor Szadovszky commented on AVRO-2055:


+1

> Remove Magic Value From org.apache.avro.hadoop.io.AvroSequenceFile
> --
>
> Key: AVRO-2055
> URL: https://issues.apache.org/jira/browse/AVRO-2055
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.7, 1.8.2
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: AVRO-2055.1.patch
>
>
> Remove magic string _io.file.buffer.size_ and _DEFAULT_BUFFER_SIZE_BYTES_ and 
> instead rely on the Hadoop libraries to provide this information.  Will help 
> to keep Avro in sync with changes in Hadoop.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)