Re: codec updates

Rafael Schloming Wed, 28 May 2014 07:03:59 -0700

On Tue, May 27, 2014 at 5:15 PM, Robbie Gemmell <[email protected]>wrote:


> I haven't had time to run any of this or look through it properly yet, but
> after a very very quick scroll through...
>
> - Should we not use LinkedHashMap instances, due to the ordering
> restriction on equality checks for amqp maps?
>

Probably, although I should say that I didn't intend for POJOBuilder to
retain 100% fidelity of AMQP type information, but rather to transform the
data stream into something that is natural to process in Java. The idea
being that this gives you a high degree of convenience (as well as avoiding
overhead) most of the time when full fidelity shouldn't matter.

As a general note, the overall design is intended to make transformation
efficient and relatively easy, so it should be possible to go directly from
the input byte stream to the domain specific objects you want to work with
in your program without unnecessary intermediate representations. This also
means that there isn't one single canonical transformer, but multiple
transformers depending on what you want to do. For example an application
might use a POJO style transformer or even a domain specific transformer
for decoding and processing an application message, but a broker might use
a transformer that goes directly from input message bytes to output message
bytes (modulo updated ttl), or the protocol engine might use a transformer
that directly executes performatives.


> - It looked like it only uses list32, array32, string32 for encoding...do
> you envisage supporting the others?
>

Yes, however there is a time/space-on-the-wire trade-off here. One of the
reasons this codec is faster on encode than the existing codec is that it
doesn't attempt to compute the encoded size of a given value in advance. It
simply reserves the space on the wire and then goes and backfills it when
it is done encoding the value. This alone turns into something close to a
factor of two in speedup because computing the size has approximately the
same cost as performing the encoding. The trade-off of course is that it
doesn't know in advance what width to reserve for the size and count, so it
needs to assume the wider width. There may be some clever things we can do
here, e.g. there may be an inexpensive way to make a mostly accurate guess
and then shift stuff around if it turns out to be wrong, however those
kinds of things would require a more representative sample of data to work
with in order to validate.

Ultimately I suspect what is optimal for speed may never be optimal for
size, so the fallback here would be to simply provide two flavors of
implementation of the Encoder interface, one tailored for speed, and one
tailored for space-on-the-wire.


> - Glancing at the example, this made me do a double take - is that an array
> of string with a url in it, or somehow an array of url?
>   (I guess I need to look at what the code actually does with this :P)
>
>         enc.putArray(Type.STRING);
>         enc.putDescriptor();
>         enc.putSymbol("url");
>         enc.putString("http://one";);
>         enc.putString("http://two";);
>         enc.putString("http://three";);
>         enc.end();
>

This is actually an array of described strings where the descriptor is the
symbol "url".


>
> - For the maps, it feels a little weird adding the keys and values
> independently, though I guess thats to let us away without additional
> map-specific methods?
>
>         enc.putMap();
>         enc.putString("key");
>         enc.putString("value");
>         enc.putString("pi");
>         enc.putDouble(3.14159265359);
>         enc.end();
>

I'm not sure how map specific methods would work. I can't think of a way to
do it offhand that wouldn't result in a fairly ugly combinatorial explosion
of signatures. I think there are two key things about this aspect of the
design. First, the API deals in stuff that is presented in serial order, so
the implementation can simply directly encode values onto the wire when
they are presented and only needs to keep around minimal internal state.
Secondly, the API is designed to be composable, e.g. imagine you have some
DomainType that knows how to serialize itself. With this interface you can
do things like:

  enc.putMap();
  enc.putString("key");
  domainType.encodeYourself(enc);
  enc.putString("next-key");
  enc.putString("next-value");
  enc.end();

You can also do:

  enc.putList();
  ...
  domainType.encodeYourself(enc);
  ...
  enc.end();

The general idea is to allow domain specific encoding pretty much the same
way that people write toString() methods, and so you need a given value to
render the same way regardless of whether it appears in a list or in a map
or even as a key in a map vs a value in a map.

I think this model becomes more natural if you think of this as an API for
expressing map/list literals rather than as an API for constructing
map/list objects, e.g.:

 { "key": "value", "pi": 3.14159265359 } -> enc.putMap();
enc.putString("key"); enc.putString("value"); enc.putString("pi");
enc.putDouble(3.14159265359); enc.end();



> - Out of interest, did you try larger heap sizes to see how much difference
> (if any) there was in their relative performance?
>

I did, and it didn't seem to make a difference. I think isolating/measuring
the overhead of "garbage" is likely to be pretty tricky outside of a real
application. In these micro benchmarks there is very little live data at
any given time, so the cost of traversing the entire set of live objects is
likely quite cheap compared to what it would be in a real application with
a larger set of live objects. Hypothetically I suppose we could try
creating artificial data sets, but I didn't bother.


> - Well into personal preference terriroty, can't say I'm a huge fan of the
> decoder converting nulls/booleans/etc into integers (albeit, if asked to).
>

That isn't intended to be the one and only way of implementing the decoder
interface. But I definitely wanted the design to be able to support that
level of coersion/conversion without significant penalties. I do believe it
is a very important use case because when you are automatically serializing
based on introspection and you're attempting to interop between languages
with different type systems, it's really kind of a crap shoot when it comes
to what type you will get. Javascript doesn't distinguish between floating
point and integral types, perl doesn't always distinguish well between
numbers and strings-that-look-like-numbers, java doesn't have unsigned
types, python doesn't really do fixed width types, so pretty much every
language that serializes based on introspection has the potential to encode
a number into an AMQP type that some other language doesn't natively
understand. So for reasonable interop based on introspective encode/decode
I think we really need to follow the robustness principal, i.e. be precise
in how we encode types and be liberal in how we decode them. The level of
conversion I put in is really geared at ensuring that we can be
sufficiently liberal. For something like the protocol engine which wouldn't
be doing encode/decode based on introspection, but instead based on a
shared schema, we could certainly use a stricter implementation of the
decoder interface that doesn't perform the same conversions.

--Rafael

Re: codec updates

Reply via email to