Well, the code looks very inefficient and not a good test of how it is
recommended to use Avro.
My hunch is that the test is not measuring serialization / deserialization, its
measuring the cost of instantiating and initializing various objects. Many of
these are very expensive since they parse the string schema -- much like a
regular expression in java, you want to compile the expression once and run it
many times -- in Avro you want to initialize the DatumWriters and DatumReaders
for a Schema once, and use it many times.
In the benchmark I recommend using System.nanoTime() since currentTimeMills()
is not accurate (its essentially +- 7ms for most OS's).
The encoding loop in the patch looks like:
+ for (int j = 0; j < mainLoop; ++j) {
+ SpecificDatumWriter writer = new
SpecificDatumWriter(ARangeSliceCommand.SCHEMA$);
+ ByteArrayOutputStream bout = new ByteArrayOutputStream();
+ BinaryEncoder enc = new BinaryEncoder(bout);
+ for (ARangeSliceCommand cmd : acmds)
+ {
+ enc.writeString(new Utf8(cmd.getSchema().toString()));
+ writer.write(cmd, enc);
+ }
+ bout.close();
+ buf = bout.toByteArray();
+ writer = null;
+ bout = null;
+ enc = null;
Honestly, without looking too far I think this is benchmarking the toString()
method of Schema. I think all objects in the acmds are the same, with the same
schema, so that whole Utf8 instantiation can move out of that loop and be done
once. Furthermore why is the schema being written for each record? Avro is
designed to avoid that -- store it once somewhere and then write multiple
records with that schema.
The patch doesn't make it easy to see how big 'mainLoop' is compared to the
size of 'acmds', if it is large and 'acmds' is small then datum writer creation
is measured heavily too.
You need to cache the SpecificDatumWriter and SpecificDatumReader and re-use it
for each schema. These are meant to be re-used and kept per record type.
Caching the BinaryEncoder and BinaryDecoder will help too if 'mainLoop' is not
small. "new BinaryDecoder()" is deprecated, use the DecoderFactory and ideally
re-use the instance when possible.
A ThreadLocal WeakHashMap (or WeakIdentityHashMap) of Class >>
SpecificDatumWriter would work if you are concerned about thread safety or the
object lifetimes.
The decoder loop is:
+ for (int j = 0; j < mainLoop; ++j) {
+ BinaryDecoder dec = new BinaryDecoder(new
ByteArrayInputStream(buf));
+ SpecificDatumReader<ARangeSliceCommand> reader = new
SpecificDatumReader<ARangeSliceCommand>(ARangeSliceCommand.SCHEMA$);
+ for (int i = 0; i < acmds.size(); i++)
+ {
+ reader.setSchema(Schema.parse(dec.readString(new
Utf8()).toString()));
+ reader.read(new ARangeSliceCommand(), dec);
+ }
First of all, dec.readString() happily takes a null, either pass in a Utf8 you
intend to reuse -- one created outside of the loop for example, or pass in
null.
Next, this test is testing Schema.parse() performance. This should be outside
the loop. The schema of the written data should be kept once and not stored
once per record. Storing the schema once with each record is an Avro
anti-pattern. If you are storing a schema with every serialized record, you
either using Avro wrong or have a use case that is not suitable for Avro at
all. Avro has a verbose JSON schema, and a very compact binary recored
representation. When a record type is serialized repeatedly, the schema need
only be persisted once, and the records with unique data persisted compactly.
There are many ways to do this that differ depending on the use case. Think
about an RDBMS -- the equivalent of a table definition is not stored with each
row. Storing an avro schema with each avro record is like storing a table
definition with each row. If you have changing schemas over time, you need to
store only one schema per change somewhere, and have a mechanism to know which
groups of records have what schemas. The canonical Avro example is the Avro
Data File -- all records in it have the same schema and the schema is stored in
the header of the file.
The decoder bit should look more like:
BinaryDecoder dec = DecoderFactory.defaultFactory().createBinaryDecoder(buf,
null);
SpecificDatumReader<ARangeSliceCommand> reader =
threadLocalReaders.get(ARangeSliceCommand.class); // a Class >>
SpecificDatumReader mapping.
reader.setSchema(Schema.parse(dec.readString(null).toString())); // schema
stored once
ARangeSliceCommand record = new ARangeSliceCommand();
for (int j = 0; j < mainLoop; ++j) {
for (int i = 0; i < acmds.size(); i++) {
reader.read(record, dec); // or pass in null if there is no object to
reuse, reusing is faster
}
}
And that would be very fast and additionally keep very few bytes per record on
average assuming mainLoop*admds.size() is large (at least a thousand).
On Nov 27, 2010, at 10:59 AM, Eric Evans wrote:
> https://issues.apache.org/jira/browse/CASSANDRA-1765