[jira] [Updated] (PARQUET-642) Improve performance of ByteBuffer based read / write paths
[ https://issues.apache.org/jira/browse/PARQUET-642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky updated PARQUET-642: - Fix Version/s: 1.8.2 > Improve performance of ByteBuffer based read / write paths > -- > > Key: PARQUET-642 > URL: https://issues.apache.org/jira/browse/PARQUET-642 > Project: Parquet > Issue Type: Bug >Reporter: Piyush Narang >Assignee: Piyush Narang >Priority: Major > Fix For: 1.9.0, 1.8.2 > > > While trying out the newest Parquet version, we noticed that the changes to > start using ByteBuffers: > https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8 > and > https://github.com/apache/parquet-mr/commit/6b24a1d1b5e2792a7821ad172a45e38d2b04f9b8 > (mostly avro but a couple of ByteBuffer changes) caused our jobs to slow > down a bit. > Read overhead: 4-6% (in MB_Millis) > Write overhead: 6-10% (MB_Millis). > Seems like this seems to be due to the encoding / decoding of Strings in the > Binary class > (https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java) > - toStringUsingUTF8() - for reads > encodeUTF8() - for writes > In those methods we're using the nio Charsets for encode / decode: > {code} > private static ByteBuffer encodeUTF8(CharSequence value) { > try { > return ENCODER.get().encode(CharBuffer.wrap(value)); > } catch (CharacterCodingException e) { > throw new ParquetEncodingException("UTF-8 not supported.", e); > } > } > } > ... > @Override > public String toStringUsingUTF8() { > int limit = value.limit(); > value.limit(offset+length); > int position = value.position(); > value.position(offset); > // no corresponding interface to read a subset of a buffer, would have > to slice it > // which creates another ByteBuffer object or do what is done here to > adjust the > // limit/offset and set them back after > String ret = UTF8.decode(value).toString(); > value.limit(limit); > value.position(position); > return ret; > } > {code} > Tried out some micro / macro benchmarks and it seems like switching those out > to using the String class for the encoding / decoding improves performance: > {code} > @Override > public String toStringUsingUTF8() { > String ret; > if (value.hasArray()) { > try { > ret = new String(value.array(), value.arrayOffset() + offset, > length, "UTF-8"); > } catch (UnsupportedEncodingException e) { > throw new ParquetDecodingException("UTF-8 not supported"); > } > } else { > int limit = value.limit(); > value.limit(offset+length); > int position = value.position(); > value.position(offset); > // no corresponding interface to read a subset of a buffer, would > have to slice it > // which creates another ByteBuffer object or do what is done here to > adjust the > // limit/offset and set them back after > ret = UTF8.decode(value).toString(); > value.limit(limit); > value.position(position); > } > return ret; > } > ... > private static ByteBuffer encodeUTF8(String value) { > try { > return ByteBuffer.wrap(value.getBytes("UTF-8")); > } catch (UnsupportedEncodingException e) { > throw new ParquetEncodingException("UTF-8 not supported.", e); > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-642) Improve performance of ByteBuffer based read / write paths
[ https://issues.apache.org/jira/browse/PARQUET-642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Le Dem updated PARQUET-642: -- Assignee: Piyush Narang > Improve performance of ByteBuffer based read / write paths > -- > > Key: PARQUET-642 > URL: https://issues.apache.org/jira/browse/PARQUET-642 > Project: Parquet > Issue Type: Bug >Reporter: Piyush Narang >Assignee: Piyush Narang > Fix For: 1.9.0 > > > While trying out the newest Parquet version, we noticed that the changes to > start using ByteBuffers: > https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8 > and > https://github.com/apache/parquet-mr/commit/6b24a1d1b5e2792a7821ad172a45e38d2b04f9b8 > (mostly avro but a couple of ByteBuffer changes) caused our jobs to slow > down a bit. > Read overhead: 4-6% (in MB_Millis) > Write overhead: 6-10% (MB_Millis). > Seems like this seems to be due to the encoding / decoding of Strings in the > Binary class > (https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java) > - toStringUsingUTF8() - for reads > encodeUTF8() - for writes > In those methods we're using the nio Charsets for encode / decode: > {code} > private static ByteBuffer encodeUTF8(CharSequence value) { > try { > return ENCODER.get().encode(CharBuffer.wrap(value)); > } catch (CharacterCodingException e) { > throw new ParquetEncodingException("UTF-8 not supported.", e); > } > } > } > ... > @Override > public String toStringUsingUTF8() { > int limit = value.limit(); > value.limit(offset+length); > int position = value.position(); > value.position(offset); > // no corresponding interface to read a subset of a buffer, would have > to slice it > // which creates another ByteBuffer object or do what is done here to > adjust the > // limit/offset and set them back after > String ret = UTF8.decode(value).toString(); > value.limit(limit); > value.position(position); > return ret; > } > {code} > Tried out some micro / macro benchmarks and it seems like switching those out > to using the String class for the encoding / decoding improves performance: > {code} > @Override > public String toStringUsingUTF8() { > String ret; > if (value.hasArray()) { > try { > ret = new String(value.array(), value.arrayOffset() + offset, > length, "UTF-8"); > } catch (UnsupportedEncodingException e) { > throw new ParquetDecodingException("UTF-8 not supported"); > } > } else { > int limit = value.limit(); > value.limit(offset+length); > int position = value.position(); > value.position(offset); > // no corresponding interface to read a subset of a buffer, would > have to slice it > // which creates another ByteBuffer object or do what is done here to > adjust the > // limit/offset and set them back after > ret = UTF8.decode(value).toString(); > value.limit(limit); > value.position(position); > } > return ret; > } > ... > private static ByteBuffer encodeUTF8(String value) { > try { > return ByteBuffer.wrap(value.getBytes("UTF-8")); > } catch (UnsupportedEncodingException e) { > throw new ParquetEncodingException("UTF-8 not supported.", e); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)