[jira] [Commented] (PARQUET-642) Improve performance of ByteBuffer based read / write paths

Piyush Narang (JIRA) Fri, 24 Jun 2016 16:14:05 -0700

    [ 
https://issues.apache.org/jira/browse/PARQUET-642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348843#comment-15348843
 ]


Piyush Narang commented on PARQUET-642:
---------------------------------------

Also tried out some benchmarks where I read / write a thrift Parquet file:
{code}
Old Run:
Benchmark                                     Mode  Samples  Score   Error  
Units
o.a.p.b.ThriftSignupAttempts.readFile_0      thrpt       50  0.467 ± 0.053  
ops/s
o.a.p.b.ThriftSignupAttempts.writeFile_0     thrpt       50  2.152 ± 0.132  
ops/s

New Run:
Benchmark                                     Mode  Samples  Score   Error  
Units
o.a.p.b.ThriftSignupAttempts.readFile_0      thrpt       50  0.571 ± 0.047  
ops/s
o.a.p.b.ThriftSignupAttempts.writeFile_0     thrpt       50  2.200 ± 0.113  
ops/s
{code}

{code}
public class TestParquetThriftReadWrite {

  public static final Path FILE_0 = new 
Path("parquet-perf/part-00000-m-00000.parquet");
  public static final Path WRITE_FILE_0 = new 
Path("parquet-perf/write-part-00000-m-00000.parquet");
  public static final int RECORD_COUNT_0 = 72048;

  public static List<MyStruct> structList = testRead(FILE_0);

  private static List<MyStruct> testRead(Path file) {
    List<MyStruct> structList = new ArrayList<MyStruct>(RECORD_COUNT_0);
    try {
      ThriftParquetReader<MyStruct> thriftParquetReader = new 
ThriftParquetReader<MyStruct>(file, MyStruct.class);

      MyStruct row = null;
      do {
        row = thriftParquetReader.read();
        structList.add(row);
      } while (row != null);

      thriftParquetReader.close();
    } catch (Exception e) {
      e.printStackTrace();
    }
    return structList;
  }

  private static void testWrite(Path file, List<MyStruct> structList) throws 
Exception {
    ThriftParquetWriter<MyStruct> thriftParquetWriter = new 
ThriftParquetWriter<MyStruct>(file, MyStruct.class, 
CompressionCodecName.UNCOMPRESSED);
    for (MyStruct struct : structList) {
      if (struct == null) {
        System.out.println("Skipping null..");
      } else {
        thriftParquetWriter.write(struct);
      }
    }

    thriftParquetWriter.close();
  }

  @Benchmark
  public void readFile_0(Blackhole blackhole) throws Exception {
    testRead(FILE_0);
  }

  @Benchmark
  public void writeFile_0(Blackhole blackhole) throws Exception {
    BenchmarkUtils.deleteIfExists(new Configuration(), WRITE_FILE_0);
    testWrite(WRITE_FILE_0, structList);
  }
}
{code}

> Improve performance of ByteBuffer based read / write paths
> ----------------------------------------------------------
>
>                 Key: PARQUET-642
>                 URL: https://issues.apache.org/jira/browse/PARQUET-642
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Piyush Narang
>
> While trying out the newest Parquet version, we noticed that the changes to 
> start using ByteBuffers: 
> https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8
>  and 
> https://github.com/apache/parquet-mr/commit/6b24a1d1b5e2792a7821ad172a45e38d2b04f9b8
>  (mostly avro but a couple of ByteBuffer changes) caused our jobs to slow 
> down a bit. 
> Read overhead: 4-6% (in MB_Millis)
> Write overhead: 6-10% (MB_Millis). 
> Seems like this seems to be due to the encoding / decoding of Strings in the 
> Binary class 
> (https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java)
>  - toStringUsingUTF8() - for reads
> encodeUTF8() - for writes
> In those methods we're using the nio Charsets for encode / decode:
> {code}
>     private static ByteBuffer encodeUTF8(CharSequence value) {
>       try {
>         return ENCODER.get().encode(CharBuffer.wrap(value));
>       } catch (CharacterCodingException e) {
>         throw new ParquetEncodingException("UTF-8 not supported.", e);
>       }
>     }
>   }
> ...
>     @Override
>     public String toStringUsingUTF8() {
>       int limit = value.limit();
>       value.limit(offset+length);
>       int position = value.position();
>       value.position(offset);
>       // no corresponding interface to read a subset of a buffer, would have 
> to slice it
>       // which creates another ByteBuffer object or do what is done here to 
> adjust the
>       // limit/offset and set them back after
>       String ret = UTF8.decode(value).toString();
>       value.limit(limit);
>       value.position(position);
>       return ret;
>     }
> {code}
> Tried out some micro / macro benchmarks and it seems like switching those out 
> to using the String class for the encoding / decoding improves performance:
> {code}
> @Override
>     public String toStringUsingUTF8() {
>       String ret;
>       if (value.hasArray()) {
>         try {
>           ret = new String(value.array(), value.arrayOffset() + offset, 
> length, "UTF-8");
>         } catch (UnsupportedEncodingException e) {
>           throw new ParquetDecodingException("UTF-8 not supported");
>         }
>       } else {
>         int limit = value.limit();
>         value.limit(offset+length);
>         int position = value.position();
>         value.position(offset);
>         // no corresponding interface to read a subset of a buffer, would 
> have to slice it
>         // which creates another ByteBuffer object or do what is done here to 
> adjust the
>         // limit/offset and set them back after
>         ret = UTF8.decode(value).toString();
>         value.limit(limit);
>         value.position(position);
>       }
>       return ret;
>     }
> ...
> private static ByteBuffer encodeUTF8(String value) {
>       try {
>         return ByteBuffer.wrap(value.getBytes("UTF-8"));
>       } catch (UnsupportedEncodingException e) {
>         throw new ParquetEncodingException("UTF-8 not supported.", e);
>       }
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PARQUET-642) Improve performance of ByteBuffer based read / write paths

Reply via email to