Tobi Vollebregt created AVRO-2147:
-------------------------------------
Summary: Proto to Avro serialization is unnecessarily slow due to
repeated schema creation
Key: AVRO-2147
URL: https://issues.apache.org/jira/browse/AVRO-2147
Project: Avro
Issue Type: Improvement
Components: java
Affects Versions: 1.8.2, 1.8.1
Reporter: Tobi Vollebregt
Hi,
I discovered that proto to avro serialization is unnecessarily slow in certain
cases due to repeated schema creation. Specifically, this slowness shows when
serializing protocol buffer messages that contain nested protocol buffer
messages that contain enums with many possible values. Some profiling showed
this is due to the {{Schema}} objects for the nested message/enum not being
cached in this case.
An example that reproduces this is to add the following to {{test.proto}}:
{{message Foo {}}
{{ ...}}
{{ optional MessageWithLargeEnum bar = 21;}}
{{}}}
{{message MessageWithLargeEnum {}}
{{ optional LargeEnum enum = 1;}}
{{}}}
{{enum LargeEnum {}}
{{ AA = 1;}}
{{ AB = 2;}}
{{ AC = 3;}}
{{ ...}}
{{ ZZ = 676;}}
{{}}}
Then, a test like the following will exhibit the slow behavior:
{{@Test public void perf() throws Exception {}}
{{ Foo.Builder builder = Foo.newBuilder();}}
{{ builder.setInt32(0);}}
{{ builder.setInt64(2);}}
{{ builder.setUint32(3);}}
{{ builder.setUint64(4);}}
{{ builder.setSint32(5);}}
{{ builder.setSint64(6);}}
{{ builder.setFixed32(7);}}
{{ builder.setFixed64(8);}}
{{ builder.setSfixed32(9);}}
{{ builder.setSfixed64(10);}}
{{ builder.setFloat(1.0F);}}
{{ builder.setDouble(2.0);}}
{{ builder.setBool(true);}}
{{ builder.setString("foo");}}
{{ builder.setBytes(ByteString.copyFromUtf8("bar"));}}
{{ builder.setEnum(org.apache.avro.protobuf.Test.A.X);}}
{{ builder.addIntArray(27);}}
{{ builder.addSyms(org.apache.avro.protobuf.Test.A.Y);}}
{{ builder.setBar(MessageWithLargeEnum.newBuilder().setEnum(LargeEnum.AA));}}
{{ Foo objToConvert = builder.build();}}
{{ Schema schema = ProtobufData.get().getSchema(Foo.class);}}
{{ ByteArrayOutputStream bao = new ByteArrayOutputStream();}}
{{ Encoder e = EncoderFactory.get().binaryEncoder(bao, null);}}
{{ ProtobufDatumWriter<Foo> w = new ProtobufDatumWriter<Foo>(schema);}}
{{ GenericDatumReader gdr = new GenericDatumReader(schema, schema);}}
{{ BinaryDecoder d = null;}}
{{ long startTime = System.nanoTime();}}
{{ for (int i = 0; i < 1000000; ++i) {}}
{{ bao.reset();}}
{{ w.write(objToConvert, e);}}
{{ e.flush();}}
{{ d = DecoderFactory.get().binaryDecoder(bao.toByteArray(), d);}}
{{ gdr.read(null, d);}}
{{ }}}
{{ long endTime = System.nanoTime();}}
{{ System.out.println("Elapsed: " + (endTime - startTime) / 1000000 + " ms");}}
{{}}}
I will attach a patch that optimizes this.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)