Piyush Narang created PARQUET-642:
-------------------------------------

             Summary: Improve performance of ByteBuffer based read / write paths
                 Key: PARQUET-642
                 URL: https://issues.apache.org/jira/browse/PARQUET-642
             Project: Parquet
          Issue Type: Bug
            Reporter: Piyush Narang


While trying out the newest Parquet version, we noticed that the changes to 
start using ByteBuffers: 
https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8
 and 
https://github.com/apache/parquet-mr/commit/6b24a1d1b5e2792a7821ad172a45e38d2b04f9b8
 (mostly avro but a couple of ByteBuffer changes) caused our jobs to slow down 
a bit. 
Read overhead: 4-6% (in MB_Millis)
Write overhead: 6-10% (MB_Millis). 

Seems like this seems to be due to the encoding / decoding of Strings in the 
Binary class 
(https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java)
 - toStringUsingUTF8() - for reads
encodeUTF8() - for writes

In those methods we're using the nio Charsets for encode / decode:
{code}
    private static ByteBuffer encodeUTF8(CharSequence value) {
      try {
        return ENCODER.get().encode(CharBuffer.wrap(value));
      } catch (CharacterCodingException e) {
        throw new ParquetEncodingException("UTF-8 not supported.", e);
      }
    }
  }
...
    @Override
    public String toStringUsingUTF8() {
      int limit = value.limit();
      value.limit(offset+length);
      int position = value.position();
      value.position(offset);
      // no corresponding interface to read a subset of a buffer, would have to 
slice it
      // which creates another ByteBuffer object or do what is done here to 
adjust the
      // limit/offset and set them back after
      String ret = UTF8.decode(value).toString();
      value.limit(limit);
      value.position(position);
      return ret;
    }
{code}

Tried out some micro / macro benchmarks and it seems like switching those out 
to using the String class for the encoding / decoding improves performance:
{code}
@Override
    public String toStringUsingUTF8() {
      String ret;
      if (value.hasArray()) {
        try {
          ret = new String(value.array(), value.arrayOffset() + offset, length, 
"UTF-8");
        } catch (UnsupportedEncodingException e) {
          throw new ParquetDecodingException("UTF-8 not supported");
        }
      } else {
        int limit = value.limit();
        value.limit(offset+length);
        int position = value.position();
        value.position(offset);
        // no corresponding interface to read a subset of a buffer, would have 
to slice it
        // which creates another ByteBuffer object or do what is done here to 
adjust the
        // limit/offset and set them back after
        ret = UTF8.decode(value).toString();
        value.limit(limit);
        value.position(position);
      }

      return ret;
    }
...
private static ByteBuffer encodeUTF8(String value) {
      try {
        return ByteBuffer.wrap(value.getBytes("UTF-8"));
      } catch (UnsupportedEncodingException e) {
        throw new ParquetEncodingException("UTF-8 not supported.", e);
      }
    }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to