[ 
https://issues.apache.org/jira/browse/AVRO-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tobi Vollebregt updated AVRO-2147:
----------------------------------
    Description: 
Hi,

I discovered that proto to avro serialization is unnecessarily slow in certain 
cases due to repeated schema creation. Specifically, this slowness shows when 
serializing protocol buffer messages that contain nested protocol buffer 
messages that contain enums with many possible values. Some profiling showed 
this is due to the {{Schema}} objects for the nested message/enum not being 
cached in this case.

An example that reproduces this is to add the following to {{test.proto}}:

{{message Foo {}}
 {{  ...}}
 {{  optional MessageWithLargeEnum bar = 21;}}
 {{}}}
 {{message MessageWithLargeEnum {}}
 {{  optional LargeEnum enum = 1;}}
 {{}}}
 {{enum LargeEnum {}}
 {{  AA = 1;}}
 {{  AB = 2;}}
 {{  AC = 3;}}
 {{  ...}}
 {{  ZZ = 676;}}
 {{}}}

Then, a test like the following will exhibit the slow behavior:

{{@Test public void perf() throws Exception {}}
 {{  Foo.Builder builder = Foo.newBuilder();}}
 {{  builder.setInt32(0);}}
 {{  builder.setInt64(2);}}
 {{  builder.setUint32(3);}}
 {{  builder.setUint64(4);}}
 {{  builder.setSint32(5);}}
 {{  builder.setSint64(6);}}
 {{  builder.setFixed32(7);}}
 {{  builder.setFixed64(8);}}
 {{  builder.setSfixed32(9);}}
 {{  builder.setSfixed64(10);}}
 {{  builder.setFloat(1.0F);}}
 {{  builder.setDouble(2.0);}}
 {{  builder.setBool(true);}}
 {{  builder.setString("foo");}}
 {{  builder.setBytes(ByteString.copyFromUtf8("bar"));}}
 {{  builder.setEnum(org.apache.avro.protobuf.Test.A.X);}}
 {{  builder.addIntArray(27);}}
 {{  builder.addSyms(org.apache.avro.protobuf.Test.A.Y);}}
 {{   builder.setBar(MessageWithLargeEnum.newBuilder().setEnum(LargeEnum.AA));}}

{{  Foo objToConvert = builder.build();}}

{{  Schema schema = ProtobufData.get().getSchema(Foo.class);}}
 {{  ByteArrayOutputStream bao = new ByteArrayOutputStream();}}
 {{  Encoder e = EncoderFactory.get().binaryEncoder(bao, null);}}
 {{  ProtobufDatumWriter<Foo> w = new ProtobufDatumWriter<Foo>(schema);}}
 {{  GenericDatumReader gdr = new GenericDatumReader(schema, schema);}}
 {{  BinaryDecoder d = null;}}

{{  long startTime = System.nanoTime();}}
 {{  for (int i = 0; i < 1000000; ++i) {}}
 {{    bao.reset();}}
 {{    w.write(objToConvert, e);}}
 {{    e.flush();}}
 {{    d = DecoderFactory.get().binaryDecoder(bao.toByteArray(), d);}}
 {{     gdr.read(null, d);}}
 \{{  }}}
 {{  long endTime = System.nanoTime();}}
 {{  System.out.println("Elapsed: " + (endTime - startTime) / 1000000 + " 
ms");}}
 {{}}}

I will attach a patch that optimizes this.

With the attached patch this test reports a runtime of about 4 seconds, while 
the runtime without the patch is 30+ seconds, so this is an 7.5-8x improvement 
for this particular test enum.

  was:
Hi,

I discovered that proto to avro serialization is unnecessarily slow in certain 
cases due to repeated schema creation. Specifically, this slowness shows when 
serializing protocol buffer messages that contain nested protocol buffer 
messages that contain enums with many possible values. Some profiling showed 
this is due to the {{Schema}} objects for the nested message/enum not being 
cached in this case.

An example that reproduces this is to add the following to {{test.proto}}:

{{message Foo {}}
 {{  ...}}
 {{  optional MessageWithLargeEnum bar = 21;}}
 {{}}}
 {{message MessageWithLargeEnum {}}
 {{  optional LargeEnum enum = 1;}}
 {{}}}
 {{enum LargeEnum {}}
 {{  AA = 1;}}
 {{  AB = 2;}}
 {{  AC = 3;}}
 {{  ...}}
 {{  ZZ = 676;}}
 {{}}}

Then, a test like the following will exhibit the slow behavior:

{{@Test public void perf() throws Exception {}}
 {{  Foo.Builder builder = Foo.newBuilder();}}
 {{  builder.setInt32(0);}}
 {{  builder.setInt64(2);}}
 {{  builder.setUint32(3);}}
 {{  builder.setUint64(4);}}
 {{  builder.setSint32(5);}}
 {{  builder.setSint64(6);}}
 {{  builder.setFixed32(7);}}
 {{  builder.setFixed64(8);}}
 {{  builder.setSfixed32(9);}}
 {{  builder.setSfixed64(10);}}
 {{  builder.setFloat(1.0F);}}
 {{  builder.setDouble(2.0);}}
 {{  builder.setBool(true);}}
 {{  builder.setString("foo");}}
 {{  builder.setBytes(ByteString.copyFromUtf8("bar"));}}
 {{  builder.setEnum(org.apache.avro.protobuf.Test.A.X);}}
 {{  builder.addIntArray(27);}}
 {{  builder.addSyms(org.apache.avro.protobuf.Test.A.Y);}}
 {{   builder.setBar(MessageWithLargeEnum.newBuilder().setEnum(LargeEnum.AA));}}

{{  Foo objToConvert = builder.build();}}

{{  Schema schema = ProtobufData.get().getSchema(Foo.class);}}
 {{  ByteArrayOutputStream bao = new ByteArrayOutputStream();}}
 {{  Encoder e = EncoderFactory.get().binaryEncoder(bao, null);}}
 {{  ProtobufDatumWriter<Foo> w = new ProtobufDatumWriter<Foo>(schema);}}
 {{  GenericDatumReader gdr = new GenericDatumReader(schema, schema);}}
 {{  BinaryDecoder d = null;}}

{{  long startTime = System.nanoTime();}}
{{  for (int i = 0; i < 1000000; ++i) {}}
{{    bao.reset();}}
{{    w.write(objToConvert, e);}}
{{    e.flush();}}
{{    d = DecoderFactory.get().binaryDecoder(bao.toByteArray(), d);}}
{{     gdr.read(null, d);}}
{{  }}}
{{  long endTime = System.nanoTime();}}
{{  System.out.println("Elapsed: " + (endTime - startTime) / 1000000 + " ms");}}
{{}}}

I will attach a patch that optimizes this.


> Proto to Avro serialization is unnecessarily slow due to repeated schema 
> creation
> ---------------------------------------------------------------------------------
>
>                 Key: AVRO-2147
>                 URL: https://issues.apache.org/jira/browse/AVRO-2147
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.8.1, 1.8.2
>            Reporter: Tobi Vollebregt
>            Priority: Major
>              Labels: java, optimization, protobuf
>         Attachments: AVRO-2147.patch
>
>
> Hi,
> I discovered that proto to avro serialization is unnecessarily slow in 
> certain cases due to repeated schema creation. Specifically, this slowness 
> shows when serializing protocol buffer messages that contain nested protocol 
> buffer messages that contain enums with many possible values. Some profiling 
> showed this is due to the {{Schema}} objects for the nested message/enum not 
> being cached in this case.
> An example that reproduces this is to add the following to {{test.proto}}:
> {{message Foo {}}
>  {{  ...}}
>  {{  optional MessageWithLargeEnum bar = 21;}}
>  {{}}}
>  {{message MessageWithLargeEnum {}}
>  {{  optional LargeEnum enum = 1;}}
>  {{}}}
>  {{enum LargeEnum {}}
>  {{  AA = 1;}}
>  {{  AB = 2;}}
>  {{  AC = 3;}}
>  {{  ...}}
>  {{  ZZ = 676;}}
>  {{}}}
> Then, a test like the following will exhibit the slow behavior:
> {{@Test public void perf() throws Exception {}}
>  {{  Foo.Builder builder = Foo.newBuilder();}}
>  {{  builder.setInt32(0);}}
>  {{  builder.setInt64(2);}}
>  {{  builder.setUint32(3);}}
>  {{  builder.setUint64(4);}}
>  {{  builder.setSint32(5);}}
>  {{  builder.setSint64(6);}}
>  {{  builder.setFixed32(7);}}
>  {{  builder.setFixed64(8);}}
>  {{  builder.setSfixed32(9);}}
>  {{  builder.setSfixed64(10);}}
>  {{  builder.setFloat(1.0F);}}
>  {{  builder.setDouble(2.0);}}
>  {{  builder.setBool(true);}}
>  {{  builder.setString("foo");}}
>  {{  builder.setBytes(ByteString.copyFromUtf8("bar"));}}
>  {{  builder.setEnum(org.apache.avro.protobuf.Test.A.X);}}
>  {{  builder.addIntArray(27);}}
>  {{  builder.addSyms(org.apache.avro.protobuf.Test.A.Y);}}
>  {{   
> builder.setBar(MessageWithLargeEnum.newBuilder().setEnum(LargeEnum.AA));}}
> {{  Foo objToConvert = builder.build();}}
> {{  Schema schema = ProtobufData.get().getSchema(Foo.class);}}
>  {{  ByteArrayOutputStream bao = new ByteArrayOutputStream();}}
>  {{  Encoder e = EncoderFactory.get().binaryEncoder(bao, null);}}
>  {{  ProtobufDatumWriter<Foo> w = new ProtobufDatumWriter<Foo>(schema);}}
>  {{  GenericDatumReader gdr = new GenericDatumReader(schema, schema);}}
>  {{  BinaryDecoder d = null;}}
> {{  long startTime = System.nanoTime();}}
>  {{  for (int i = 0; i < 1000000; ++i) {}}
>  {{    bao.reset();}}
>  {{    w.write(objToConvert, e);}}
>  {{    e.flush();}}
>  {{    d = DecoderFactory.get().binaryDecoder(bao.toByteArray(), d);}}
>  {{     gdr.read(null, d);}}
>  \{{  }}}
>  {{  long endTime = System.nanoTime();}}
>  {{  System.out.println("Elapsed: " + (endTime - startTime) / 1000000 + " 
> ms");}}
>  {{}}}
> I will attach a patch that optimizes this.
> With the attached patch this test reports a runtime of about 4 seconds, while 
> the runtime without the patch is 30+ seconds, so this is an 7.5-8x 
> improvement for this particular test enum.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to