Several topics: Naming, in memory representation of Avro objects, future format enchancements

Scott Carey Tue, 12 Jan 2010 21:09:55 -0800

I have been involved in a big Avro prototyping phase recently (closing in on 
implementation rapidly).  I'm sure I'll be contributing to its improvement 
myself down the road, but the first task is to use it.


My current largest *.avsc file is 3800 bytes and growing.  I'm primarily using 
the Specific API on trunk (looking forward to 1.3!), and looking at writing 
quick/dirty wrappers for Pig, Cascading, and M/R to read/write just these 
static types until general solutions are available or we have time to make one.

Various topics are on my mind.

* Tutorial
    The tutorial only works with protocols.  I want schemas and probably won't 
touch protocols for a long time.  I had to make my own maven plugin that 
handled both.  Maybe there should be an official avro-maven-plugin.  Building 
out a broader tutorial will help developers get up to speed a lot faster.  More 
complicated sample schemas in the test suite would be useful too.

* Naming, JSON schema definition, and schema migration.
    A big concern of mine is long term schema migration, and how that relates 
to how I design my schemas.  Unions quickly become necessary, but their 
restrictions as discussed in https://issues.apache.org/jira/browse/AVRO-248 are 
both good and bad.  I ran into one case where I wanted:
{"name": "ipAddr", "type": [
  {"name": "IPv4", "type": "fixed", "size": 4 },
  {"name": "IPv6", "type": "fixed", "size": 16 }
  ]
}
This won't work, even though these have their own names and generate their own 
classes.  I suppose it is tricker outside of the specific compiler and Java.  I 
could make the union one of records that were wrappers around these two, but 
then the in memory representation of this data would be inefficient.  I settled 
on a variable length "bytes" as it is just as efficient serialized and more 
efficient in memory.  Wrapper classes will have to enforce the sizes.  Writing 
wrapper classes around Unions right now is annoying -- there's no way I'm 
letting client code cast stuff, so I'm using inheritance, factories, and 
generics to get something off the ground that encapsulates the usage of Avro 
and doesn't expose serialization details.

I was also confused at first about naming, and how much of it seems unnecessary 
from a "just look at the text" view.  For example:
{"name": "Thing", "type": "record", "fields": [
  {"name": "foo", "type": "string"},
  {"name": "bars", "type": "array", "items": "int"}
  ]
}
is invalid, but
{"name": "Thing", "type": "record", "fields": [
  {"name": "foo", "type": "string"},
  {"name": "bars", "type": {"type": "array", "items": "int"}}
  ]
}
is, even though from a human readable point of view, the extra information is 
redundant ("I told it it was an array of ints, but it demands to be an array of 
ints!"  :D)

I understand what it is doing and why for the most part, but the Spec didn't 
make it clear -- some things CAN'T have names.  Even, when required to have a 
name. (fields must have names, arrays can't have names, therefore arrays can't 
be fields? -- half true).  Some examples with counter-examples would be a plus.
Also, the error message from the first one above is very confusing for someone 
new "invalid name 'array'" -- um, I named it "bars", "array" is the type.
  
I think the first snippet above could be shorthand for the latter one although 
any ambiguity is bad for making schema resolution robust.  Oddly, adding naming 
to unions, arrays, etc have the possibility of _reducing_ the verbosity of the 
JSON.   Yes, one would "have to" name the array, but one wouldn't be forced to 
create an anonymous record inside a field, since fields must be named and 
almost everything in a *.avsc is a field.

All of this, including AVRO-248 make me concerned a bit about data migration.  
If I have a data file serialized with a schema where unions are unnamed, and 
then later upgrade, how am I to resolve those?  Should an Avro schema always 
contain a "version": "1.3" or something similar in the record definition?  Like 
namespaces it can be assumed that the version propagates down to children 
unless overridden.  I will want future code to be able to read old schemas, or 
at minimum break very reliably.

* In Memory data representation
    Avro is very good at reducing serialized size, but doesn't optimize memory 
footprint.  None of this is a big deal for the typical Hadoop use case, but for 
my use cases where I want to serialize these things into BDB's or some other 
key/value store -- in memory footprint is critical.  Extra nested object 
references can easily consume a lot of memory and reduces the effectiveness of 
in memory caching for key/value stores.  Another time you would want to make 
sure minimum memory is used is in a map side join.
    Although some can be done to trim up the Specific API, I think that an 
annotations based approach (with ASM) will be the most flexible and powerful in 
the long run.  For example a fixed size object can just be a byte[] with an 
annotation, so that ASM knows how to decorate the setter/getter to enforce the 
size and what the Avro properties for the field are, etc. -- rather than having 
to be its own object that inherits from an abstract fixed type.  Much other 
naming can collapse from objects to methods/annotations this way when 
generating classes from schemas, and vice-versa to generate schemas from 
classes without otherwise altering them.


* Schema re-use
   Schema re-use is a challenge.  Since all the types have to be available in 
the same JSON parse, some things get duplicated.  I have duplicate GUID and 
IpAddr named records inside of different *.avsc files, for example.   Some 
built-in way to have code-reuse will be helpful.  I noticed that some unit 
tests seem to pre-process some includes.  That looks a bit clunky.  What else 
is there?  It would be useful if the Specific compiler took a set of files and 
compiled them all, looking across the file set for named items that don't yet 
exist in the one currently processing.  I suppose I could put them all into one 
*.avsc in an array.  That won't take long to end up at 50K plus of text in one 
file if Avro is used a lot though.

*  Future format ideas
  One thing I noticed was that there is a fairly big temptation in some cases 
to have a union with one item and null.   I have been able to essentially treat 
empty string, empty array, empty byte[], or a sentry value as equivalent to 
null for most cases, but that brought me to an idea:
  For many types, Avro serializes the number of entries first.  Why not treat a 
"-1" serialized length as equivalent to null?
Then 
[ "null", "string" ]
serializes to no more space than just "string", the same goes for arrays, 
bytes, etc.  One problem with that would be that null would always sort to one 
side of the union types.  But it does make it "free" for any types with a size. 
 Obviously, this is a rather incompatible change not for consideration any time 
soon.  And it might be a lot of work for one byte.

 Another idea is something UNION-esque.  I'm not sure what to call it, maybe a 
multi-union, maybe a sparse-record.  The idea is that you define a bunch of 
optional fields in a record/union like container.  All, or none of the items 
may be present.  Right now you can simulate this with a bunch of nested unions 
with nulls (which is horrible to actually use with either API), but if it was 
intrinsic then a bitmask or other more compact serialization could be used.  
For some types of data where there are a lot of sparse fields, this is helpful. 
 A map isn't great because the keys are big, and the types must all be the 
same.   A union has only one branch and restrictions on duplication of type.  
The in memory footprint of this can also be optimized in ways you can't do by 
simulation with other types.  This is the only gap I sense when defining 
schemas.
"Its either one of these, or one of those" >>> does either contain data ? union 
: enum
"It is not always there" >>> union with null or encode null as a value (-1, "", 
[], etc);
"Its got a bunch of fields, but only a few of them are ever set" >>> ???
Granted, for the third one that is usually a sign of some very unstructured 
data, but it does happen on occasion.  I don't have a need for it any time soon.

What about field-groups?   Sometimes, records are large and one wants to 
project out only a few fields.   But it can take a while to find the 30th field 
in a complex record.  Often, columns are frequently accessed together, so many 
systems have the ability to group them together.  For a single Avro record, 
there could be length markers to allow the efficient skipping over column 
groups within records.  This would increase the serialized size, but for large 
records, 3 or 4 markers to break it up could be a huge performance win when 
projecting out subsets of data. 

* Building and Packaging Java
 Published maven artifacts including source would be useful.  Its nice to pull 
up code in eclipse, click through on an Avro class and see the source without 
having to configure anything -- or for the debugger to chase source code to 
other packages without manually finding the source.  Especially when helping 
others on their machines (sine I have the Avro source :D).
 Dependencies need to be documented somewhere fairly visible.  What does Avro 
need for runtime usage of only the Specific API?  Generic?  What extra does it 
need to generate source code from schemas/protocols?  What extra does it need 
to generate protocols/schemas from classes via reflection?  
The full "avroj" jar file is huge -- 4MB.  Most of it is only needed at build 
time.  


I think that is enough for now ...

Looking forward to more Avro!

Several topics: Naming, in memory representation of Avro objects, future format enchancements

Reply via email to