Re: (de)serialization comparisons

Scott Carey Sun, 28 Nov 2010 00:01:54 -0800

Well, the code looks very inefficient and not a good test of how it is 
recommended to use Avro.


My hunch is that the test is not measuring serialization / deserialization, its 
measuring the cost of instantiating and initializing various objects.  Many of 
these are very expensive since they parse the string schema -- much like a 
regular expression in java, you want to compile the expression once and run it 
many times -- in Avro you want to initialize the DatumWriters and DatumReaders 
for a Schema once, and use it many times.

In the benchmark I recommend using System.nanoTime() since currentTimeMills() 
is not accurate (its essentially +- 7ms for most OS's).

The encoding loop in the patch looks like:

+        for (int j = 0; j < mainLoop; ++j) {
+               SpecificDatumWriter writer = new 
SpecificDatumWriter(ARangeSliceCommand.SCHEMA$);
+               ByteArrayOutputStream bout = new ByteArrayOutputStream();
+               BinaryEncoder enc = new BinaryEncoder(bout);
+               for (ARangeSliceCommand cmd : acmds)
+               {
+                       enc.writeString(new Utf8(cmd.getSchema().toString()));
+                       writer.write(cmd, enc);
+               }
+            bout.close();
+            buf = bout.toByteArray();
+            writer = null;
+            bout = null;
+            enc = null;


Honestly, without looking too far I think this is benchmarking the toString() 
method of Schema.  I think all objects in the acmds are the same, with the same 
schema, so that whole Utf8 instantiation can move out of that loop and be done 
once.  Furthermore why is the schema being written for each record?  Avro is 
designed to avoid that -- store it once somewhere and then write multiple 
records with that schema.   
The patch doesn't make it easy to see how big 'mainLoop' is compared to the 
size of 'acmds', if it is large and 'acmds' is small then datum writer creation 
is measured heavily too.

You need to cache the SpecificDatumWriter and SpecificDatumReader and re-use it 
for each schema.  These are meant to be re-used and kept per record type.
Caching the BinaryEncoder and BinaryDecoder will help too if 'mainLoop' is not 
small.  "new BinaryDecoder()" is deprecated, use the DecoderFactory and ideally 
re-use the instance when possible.
A ThreadLocal WeakHashMap (or WeakIdentityHashMap) of Class >> 
SpecificDatumWriter would work if you are concerned about thread safety or the 
object lifetimes.

The decoder loop is:
+        for (int j = 0; j < mainLoop; ++j) {
+               BinaryDecoder dec = new BinaryDecoder(new 
ByteArrayInputStream(buf));
+               SpecificDatumReader<ARangeSliceCommand> reader = new 
SpecificDatumReader<ARangeSliceCommand>(ARangeSliceCommand.SCHEMA$);
+               for (int i = 0; i < acmds.size(); i++)
+               {
+                       reader.setSchema(Schema.parse(dec.readString(new 
Utf8()).toString()));
+                       reader.read(new ARangeSliceCommand(), dec);
+               }

First of all,  dec.readString() happily takes a null, either pass in a Utf8 you 
intend to reuse -- one created outside of the loop for example, or pass in 
null.  
Next, this test is testing Schema.parse() performance.  This should be outside 
the loop.  The schema of the written data should be kept once and not stored 
once per record.  Storing the schema once with each record is an Avro 
anti-pattern.  If you are storing a schema with every serialized record, you 
either using Avro wrong or have a use case that is not suitable for Avro at 
all.  Avro has a verbose JSON schema, and a very compact binary recored 
representation.  When a record type is serialized repeatedly, the schema need 
only be persisted once, and the records with unique data persisted compactly.  
There are many ways to do this that differ depending on the use case.  Think 
about an RDBMS -- the equivalent of a table definition is not stored with each 
row.  Storing an avro schema with each avro record is like storing a table 
definition with each row.  If you have changing schemas over time, you need to 
store only one schema per change somewhere, and have a mechanism to know which 
groups of records have what schemas.  The canonical Avro example is the Avro 
Data File -- all records in it have the same schema and the schema is stored in 
the header of the file.

The decoder bit should look more like:

BinaryDecoder dec = DecoderFactory.defaultFactory().createBinaryDecoder(buf, 
null); 
SpecificDatumReader<ARangeSliceCommand> reader = 
threadLocalReaders.get(ARangeSliceCommand.class); // a Class >> 
SpecificDatumReader mapping.
reader.setSchema(Schema.parse(dec.readString(null).toString())); // schema 
stored once
ARangeSliceCommand record = new ARangeSliceCommand();
for (int j = 0; j < mainLoop; ++j) {  
  for (int i = 0; i < acmds.size(); i++)  {
    reader.read(record, dec); // or pass in null if there is no object to 
reuse, reusing is faster
  }
}


And that would be very fast and additionally keep very few bytes per record on 
average assuming mainLoop*admds.size() is large (at least a thousand).


On Nov 27, 2010, at 10:59 AM, Eric Evans wrote:

> https://issues.apache.org/jira/browse/CASSANDRA-1765

Re: (de)serialization comparisons

Reply via email to