[
https://issues.apache.org/jira/browse/PARQUET-642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348843#comment-15348843
]
Piyush Narang commented on PARQUET-642:
---------------------------------------
Also tried out some benchmarks where I read / write a thrift Parquet file:
{code}
Old Run:
Benchmark Mode Samples Score Error
Units
o.a.p.b.ThriftSignupAttempts.readFile_0 thrpt 50 0.467 ± 0.053
ops/s
o.a.p.b.ThriftSignupAttempts.writeFile_0 thrpt 50 2.152 ± 0.132
ops/s
New Run:
Benchmark Mode Samples Score Error
Units
o.a.p.b.ThriftSignupAttempts.readFile_0 thrpt 50 0.571 ± 0.047
ops/s
o.a.p.b.ThriftSignupAttempts.writeFile_0 thrpt 50 2.200 ± 0.113
ops/s
{code}
{code}
public class TestParquetThriftReadWrite {
public static final Path FILE_0 = new
Path("parquet-perf/part-00000-m-00000.parquet");
public static final Path WRITE_FILE_0 = new
Path("parquet-perf/write-part-00000-m-00000.parquet");
public static final int RECORD_COUNT_0 = 72048;
public static List<MyStruct> structList = testRead(FILE_0);
private static List<MyStruct> testRead(Path file) {
List<MyStruct> structList = new ArrayList<MyStruct>(RECORD_COUNT_0);
try {
ThriftParquetReader<MyStruct> thriftParquetReader = new
ThriftParquetReader<MyStruct>(file, MyStruct.class);
MyStruct row = null;
do {
row = thriftParquetReader.read();
structList.add(row);
} while (row != null);
thriftParquetReader.close();
} catch (Exception e) {
e.printStackTrace();
}
return structList;
}
private static void testWrite(Path file, List<MyStruct> structList) throws
Exception {
ThriftParquetWriter<MyStruct> thriftParquetWriter = new
ThriftParquetWriter<MyStruct>(file, MyStruct.class,
CompressionCodecName.UNCOMPRESSED);
for (MyStruct struct : structList) {
if (struct == null) {
System.out.println("Skipping null..");
} else {
thriftParquetWriter.write(struct);
}
}
thriftParquetWriter.close();
}
@Benchmark
public void readFile_0(Blackhole blackhole) throws Exception {
testRead(FILE_0);
}
@Benchmark
public void writeFile_0(Blackhole blackhole) throws Exception {
BenchmarkUtils.deleteIfExists(new Configuration(), WRITE_FILE_0);
testWrite(WRITE_FILE_0, structList);
}
}
{code}
> Improve performance of ByteBuffer based read / write paths
> ----------------------------------------------------------
>
> Key: PARQUET-642
> URL: https://issues.apache.org/jira/browse/PARQUET-642
> Project: Parquet
> Issue Type: Bug
> Reporter: Piyush Narang
>
> While trying out the newest Parquet version, we noticed that the changes to
> start using ByteBuffers:
> https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8
> and
> https://github.com/apache/parquet-mr/commit/6b24a1d1b5e2792a7821ad172a45e38d2b04f9b8
> (mostly avro but a couple of ByteBuffer changes) caused our jobs to slow
> down a bit.
> Read overhead: 4-6% (in MB_Millis)
> Write overhead: 6-10% (MB_Millis).
> Seems like this seems to be due to the encoding / decoding of Strings in the
> Binary class
> (https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java)
> - toStringUsingUTF8() - for reads
> encodeUTF8() - for writes
> In those methods we're using the nio Charsets for encode / decode:
> {code}
> private static ByteBuffer encodeUTF8(CharSequence value) {
> try {
> return ENCODER.get().encode(CharBuffer.wrap(value));
> } catch (CharacterCodingException e) {
> throw new ParquetEncodingException("UTF-8 not supported.", e);
> }
> }
> }
> ...
> @Override
> public String toStringUsingUTF8() {
> int limit = value.limit();
> value.limit(offset+length);
> int position = value.position();
> value.position(offset);
> // no corresponding interface to read a subset of a buffer, would have
> to slice it
> // which creates another ByteBuffer object or do what is done here to
> adjust the
> // limit/offset and set them back after
> String ret = UTF8.decode(value).toString();
> value.limit(limit);
> value.position(position);
> return ret;
> }
> {code}
> Tried out some micro / macro benchmarks and it seems like switching those out
> to using the String class for the encoding / decoding improves performance:
> {code}
> @Override
> public String toStringUsingUTF8() {
> String ret;
> if (value.hasArray()) {
> try {
> ret = new String(value.array(), value.arrayOffset() + offset,
> length, "UTF-8");
> } catch (UnsupportedEncodingException e) {
> throw new ParquetDecodingException("UTF-8 not supported");
> }
> } else {
> int limit = value.limit();
> value.limit(offset+length);
> int position = value.position();
> value.position(offset);
> // no corresponding interface to read a subset of a buffer, would
> have to slice it
> // which creates another ByteBuffer object or do what is done here to
> adjust the
> // limit/offset and set them back after
> ret = UTF8.decode(value).toString();
> value.limit(limit);
> value.position(position);
> }
> return ret;
> }
> ...
> private static ByteBuffer encodeUTF8(String value) {
> try {
> return ByteBuffer.wrap(value.getBytes("UTF-8"));
> } catch (UnsupportedEncodingException e) {
> throw new ParquetEncodingException("UTF-8 not supported.", e);
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)