[jira] [Updated] (PARQUET-642) Improve performance of ByteBuffer based read / write paths

2018-04-21 Thread Gabor Szadovszky (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-642:
-
Fix Version/s: 1.8.2

> Improve performance of ByteBuffer based read / write paths
> --
>
> Key: PARQUET-642
> URL: https://issues.apache.org/jira/browse/PARQUET-642
> Project: Parquet
>  Issue Type: Bug
>Reporter: Piyush Narang
>Assignee: Piyush Narang
>Priority: Major
> Fix For: 1.9.0, 1.8.2
>
>
> While trying out the newest Parquet version, we noticed that the changes to 
> start using ByteBuffers: 
> https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8
>  and 
> https://github.com/apache/parquet-mr/commit/6b24a1d1b5e2792a7821ad172a45e38d2b04f9b8
>  (mostly avro but a couple of ByteBuffer changes) caused our jobs to slow 
> down a bit. 
> Read overhead: 4-6% (in MB_Millis)
> Write overhead: 6-10% (MB_Millis). 
> Seems like this seems to be due to the encoding / decoding of Strings in the 
> Binary class 
> (https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java)
>  - toStringUsingUTF8() - for reads
> encodeUTF8() - for writes
> In those methods we're using the nio Charsets for encode / decode:
> {code}
> private static ByteBuffer encodeUTF8(CharSequence value) {
>   try {
> return ENCODER.get().encode(CharBuffer.wrap(value));
>   } catch (CharacterCodingException e) {
> throw new ParquetEncodingException("UTF-8 not supported.", e);
>   }
> }
>   }
> ...
> @Override
> public String toStringUsingUTF8() {
>   int limit = value.limit();
>   value.limit(offset+length);
>   int position = value.position();
>   value.position(offset);
>   // no corresponding interface to read a subset of a buffer, would have 
> to slice it
>   // which creates another ByteBuffer object or do what is done here to 
> adjust the
>   // limit/offset and set them back after
>   String ret = UTF8.decode(value).toString();
>   value.limit(limit);
>   value.position(position);
>   return ret;
> }
> {code}
> Tried out some micro / macro benchmarks and it seems like switching those out 
> to using the String class for the encoding / decoding improves performance:
> {code}
> @Override
> public String toStringUsingUTF8() {
>   String ret;
>   if (value.hasArray()) {
> try {
>   ret = new String(value.array(), value.arrayOffset() + offset, 
> length, "UTF-8");
> } catch (UnsupportedEncodingException e) {
>   throw new ParquetDecodingException("UTF-8 not supported");
> }
>   } else {
> int limit = value.limit();
> value.limit(offset+length);
> int position = value.position();
> value.position(offset);
> // no corresponding interface to read a subset of a buffer, would 
> have to slice it
> // which creates another ByteBuffer object or do what is done here to 
> adjust the
> // limit/offset and set them back after
> ret = UTF8.decode(value).toString();
> value.limit(limit);
> value.position(position);
>   }
>   return ret;
> }
> ...
> private static ByteBuffer encodeUTF8(String value) {
>   try {
> return ByteBuffer.wrap(value.getBytes("UTF-8"));
>   } catch (UnsupportedEncodingException e) {
> throw new ParquetEncodingException("UTF-8 not supported.", e);
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-642) Improve performance of ByteBuffer based read / write paths

2016-06-30 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-642:
--
Assignee: Piyush Narang

> Improve performance of ByteBuffer based read / write paths
> --
>
> Key: PARQUET-642
> URL: https://issues.apache.org/jira/browse/PARQUET-642
> Project: Parquet
>  Issue Type: Bug
>Reporter: Piyush Narang
>Assignee: Piyush Narang
> Fix For: 1.9.0
>
>
> While trying out the newest Parquet version, we noticed that the changes to 
> start using ByteBuffers: 
> https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8
>  and 
> https://github.com/apache/parquet-mr/commit/6b24a1d1b5e2792a7821ad172a45e38d2b04f9b8
>  (mostly avro but a couple of ByteBuffer changes) caused our jobs to slow 
> down a bit. 
> Read overhead: 4-6% (in MB_Millis)
> Write overhead: 6-10% (MB_Millis). 
> Seems like this seems to be due to the encoding / decoding of Strings in the 
> Binary class 
> (https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java)
>  - toStringUsingUTF8() - for reads
> encodeUTF8() - for writes
> In those methods we're using the nio Charsets for encode / decode:
> {code}
> private static ByteBuffer encodeUTF8(CharSequence value) {
>   try {
> return ENCODER.get().encode(CharBuffer.wrap(value));
>   } catch (CharacterCodingException e) {
> throw new ParquetEncodingException("UTF-8 not supported.", e);
>   }
> }
>   }
> ...
> @Override
> public String toStringUsingUTF8() {
>   int limit = value.limit();
>   value.limit(offset+length);
>   int position = value.position();
>   value.position(offset);
>   // no corresponding interface to read a subset of a buffer, would have 
> to slice it
>   // which creates another ByteBuffer object or do what is done here to 
> adjust the
>   // limit/offset and set them back after
>   String ret = UTF8.decode(value).toString();
>   value.limit(limit);
>   value.position(position);
>   return ret;
> }
> {code}
> Tried out some micro / macro benchmarks and it seems like switching those out 
> to using the String class for the encoding / decoding improves performance:
> {code}
> @Override
> public String toStringUsingUTF8() {
>   String ret;
>   if (value.hasArray()) {
> try {
>   ret = new String(value.array(), value.arrayOffset() + offset, 
> length, "UTF-8");
> } catch (UnsupportedEncodingException e) {
>   throw new ParquetDecodingException("UTF-8 not supported");
> }
>   } else {
> int limit = value.limit();
> value.limit(offset+length);
> int position = value.position();
> value.position(offset);
> // no corresponding interface to read a subset of a buffer, would 
> have to slice it
> // which creates another ByteBuffer object or do what is done here to 
> adjust the
> // limit/offset and set them back after
> ret = UTF8.decode(value).toString();
> value.limit(limit);
> value.position(position);
>   }
>   return ret;
> }
> ...
> private static ByteBuffer encodeUTF8(String value) {
>   try {
> return ByteBuffer.wrap(value.getBytes("UTF-8"));
>   } catch (UnsupportedEncodingException e) {
> throw new ParquetEncodingException("UTF-8 not supported.", e);
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)