Piyush Narang created PARQUET-642:
-------------------------------------
Summary: Improve performance of ByteBuffer based read / write paths
Key: PARQUET-642
URL: https://issues.apache.org/jira/browse/PARQUET-642
Project: Parquet
Issue Type: Bug
Reporter: Piyush Narang
While trying out the newest Parquet version, we noticed that the changes to
start using ByteBuffers:
https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8
and
https://github.com/apache/parquet-mr/commit/6b24a1d1b5e2792a7821ad172a45e38d2b04f9b8
(mostly avro but a couple of ByteBuffer changes) caused our jobs to slow down
a bit.
Read overhead: 4-6% (in MB_Millis)
Write overhead: 6-10% (MB_Millis).
Seems like this seems to be due to the encoding / decoding of Strings in the
Binary class
(https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java)
- toStringUsingUTF8() - for reads
encodeUTF8() - for writes
In those methods we're using the nio Charsets for encode / decode:
{code}
private static ByteBuffer encodeUTF8(CharSequence value) {
try {
return ENCODER.get().encode(CharBuffer.wrap(value));
} catch (CharacterCodingException e) {
throw new ParquetEncodingException("UTF-8 not supported.", e);
}
}
}
...
@Override
public String toStringUsingUTF8() {
int limit = value.limit();
value.limit(offset+length);
int position = value.position();
value.position(offset);
// no corresponding interface to read a subset of a buffer, would have to
slice it
// which creates another ByteBuffer object or do what is done here to
adjust the
// limit/offset and set them back after
String ret = UTF8.decode(value).toString();
value.limit(limit);
value.position(position);
return ret;
}
{code}
Tried out some micro / macro benchmarks and it seems like switching those out
to using the String class for the encoding / decoding improves performance:
{code}
@Override
public String toStringUsingUTF8() {
String ret;
if (value.hasArray()) {
try {
ret = new String(value.array(), value.arrayOffset() + offset, length,
"UTF-8");
} catch (UnsupportedEncodingException e) {
throw new ParquetDecodingException("UTF-8 not supported");
}
} else {
int limit = value.limit();
value.limit(offset+length);
int position = value.position();
value.position(offset);
// no corresponding interface to read a subset of a buffer, would have
to slice it
// which creates another ByteBuffer object or do what is done here to
adjust the
// limit/offset and set them back after
ret = UTF8.decode(value).toString();
value.limit(limit);
value.position(position);
}
return ret;
}
...
private static ByteBuffer encodeUTF8(String value) {
try {
return ByteBuffer.wrap(value.getBytes("UTF-8"));
} catch (UnsupportedEncodingException e) {
throw new ParquetEncodingException("UTF-8 not supported.", e);
}
}
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)