[orientdb] Schema Driven Binary Serialization - draft spec

Steve Sun, 06 Apr 2014 20:13:27 -0700

*Schema Driven Binary Serialization*
(see issue #1890
<https://github.com/orientechnologies/orientdb/issues/1890>)


Code on Github
<https://github.com/shadders/orientdb/tree/binary-serialization/binary>

*Status

*Note that current code on Github does not yet integrate with OrientDB. 
There are some issues to resolve with a structural change to the way
OClass is represented before it can work properly (see 'Things to Deal
With' section).  Until then it is simply demonstrated by
serializing/deserializing directly with arrays.

I will make a seperate post with a list of questions I need to answer to
progress.  But there are some obvious questions that arise from this
post so if anyone wants to answer them feel free.

*Definitions*

*Varint*: An integer that is represented by any number of 1 byte
segments.  The first bit of each segment is set to 1 if there is a
subsequent segment.  A number is constructed by concatenating the last 7
bits of each byte.  This allows for the following value ranges:
1 byte : 127
2 bytes: 16k
3 bytes: 2m
4 bytes: 268m
Number of bytes is not bounded, by reading into a BigInteger you could
represent arbitrarily large integers although in this specification we
only use them to represent Java int and long types.

*nameId*: an int (or long) index from an array of field names.  This
index is proposed to be one per class.  This may result in some
redundancy of names but by keeping the number per class low we can
represent field names in each record using less bytes.  Getting the
field name using the nameId is a single array lookup.  This is stored on
disk as a varint allowing 127 names in a single byte and 16k names per
class before we need to use a 3rd byte for name storage per record.

*Format Specification*

The 'required' field indicates whether a header field is necessary to
make this proposal work.  If 'no' then it's an optional feature that we
can choose to incorporate or not.

*Header*

*Field**
*       *Type**
*       *Required?**
*       *Implemented**
*       *Description**
*

HEADER_HEADER
The first 4 fields of this section are padded to 6 bytes.  This is
because headerLength cannot be known until the entire header is written
and as it is a varint it's own length in bytes is unknown.  By padding
to 6 bytes in the majority of cases headerLength will fit in the padded
section.  On the rare exceptions the header bytes to right will need to
be shifted.

format
        byte
        no
        yes
        Specifies both the header format and the data serialization format. 
The 4 most significant bits indicate header format and the 4 LSBs
indicate data serialization format.
e.g. if we wish to make use of the optional compressedbits header field
we could specify this a different header format.  Alternately this field
could be used to indicate padding or data hole policy.  For
testing/debug it may be convenient to use a string data serialization
format.  Or if using compressedbits this field may specify which
compression algorithm or settings to use.
classId
        varint
        yes
        yes
        A unique id for the class this record represents.  This gives a key to
locating the OClassSet.  0 is reserved for the static instance
SCHEMALESS_SET which has a single OClassVersion: SCHEMALESS and contains
no fields.
version
        varint
        yes
        yes
        The schema version within the class represented by OClassVersion which
can be looked up from OClassSet.  Versioning schemas allows us to
updated the schema without necessarily having to update all records on
disk.  In the case where a constraint has narrowed (i.e. adding NOT
NULL) to a property this may still be necessary but this an issue
OrientDB has to contend with now so it can handled using the same policy.
headerLength
        varint
        yes*
        yes
        Length (in bytes) of the complete header excluding dataLength. 
dataLength can be read by setting offset to headerLength.  This avoids
needing to parse the entire header to read dataLength and calculate the
dataOffset.  *If we accept that the full header must be parsed this
field is not required/
fieldCount
        varint
        yes
        yes
        The number of fields in the header.  This doesn't include fixed length
fields (as all metadata for those is available from the schema).  It
does include null fields which in the case of a variable length schema
declared field, means there isn't actually any information in the header.
nullbitsLength
        varint
        yes
        yes
        The number of bytes to read for the nullbits array
nullbits
        byte[]
        yes
        yes
        An array of bytes in which each bit (starting from LSB) is set to 1 if
the corresponding internally ordered field is null.  i.e. the first
fixed length field is index 0 which corresponds to the rightmost bit.
compressedbits
        byte[]
        no
        no
        A similar set of flag bits like nullbits but indicating whether a
field's data is compressed.  Additional field flags like this could be
added for other attributes if needed.  Note that length is not needed
because it will be the same length as nullbits.

SCHEMA DECLARED FIXED LENGTH FIELD HEADER ENTRIES
There are not entries in the header for these fields as all required
metadata is immutable per record and contained within the schema


SCHEMA DECLARED VARIABLE LENGTH FIELD HEADER ENTRIES
These entries do not require nameId or dataType as they immutable per
record and contained within the schema
Note: fields with a null value (indicated by nullbits) do not require an
offset or length entry so in this case there will be no entry at all. 
However it is still counted in the 'fieldCount' header field.

offset
        varint
        yes
        yes
        The offset within the data portion of the record. i.e. offset +
dataOffset = real offset.
length
        varint
        yes
        yes
        Length in bytes of the serialized field.  Note that by keeping length
in the header rather than prepending it to the data itself we gain the
ability to scan for data holes without having the read the data itself.

SCHEMALESS FIELD HEADER ENTRIES
Fields that are not declared in the schema require additional metadata.
Note: fields with a null value (indicated by nullbits) do not require an
offset or length entry so they will only have a nameId and a dataType
(arguable whether dataType is needed either?)

nameId
        varint
        yes
        yes
        Index of the actual field name String within the class.  This can be
retrieved by a simple ArrayList.get(nameId).
dataType
        byte
        yes
        yes
        Equivalent to OType.id which is currently equivilent to OType.ordinal()
although current code does not guarantee this will always be the case (I
think it should).
offset
        varint
        yes
        yes
        The offset within the data portion of the record. i.e. offset +
dataOffset = real offset.
length
        varint
        yes
        yes
        Length in bytes of the serialized field.

END OF FIELD ENTRIES

dataLength
        uint32
        yes
        yes
        Length of data (which may optionally include some reserved space). 
This field is written as a uint32 rather than a varint to simplify
calculating the dataOffset internally.  dataOffset is not written in the
header as it simply headerLength + 4 bytes (for the uint32).



*Data*

The header format is agnostic to the data serialization format.  All
that it requires is the the serializer be able to recieve an offset to
write to and return a dataLength once written.  Currently we are using
the existing serializers retrieved from OBinarySerializerFactory. 
However not all OTypes are covered and some introduce redundant length
fields. 

Embedded data types could in theory use the same header format and
serialize the same way as a parent type.

*Things to Deal With*

*Class inheritance

*Currently no account is taken of parent classes.  The code needs to
consolidate parent class fields and sort them all internally the same
way a single class does.

*Embedded (and collections/map of embedded)*

Embedded data types could in theory use the same header format and
serialize the same way as a parent type within the parent's data
section.  Need to look at whether to use this approach (simple) or
whether there is any benefit in consolidating fields of embedded
documents into the parent header.

*Links (and collections/maps)

*This hasn't been visited yet.

*OType.ANY*

No binary serializer currently exists that can handle OType.ANY.  I need
to find out if there is existing code to determine type from untyped
input.  I assume there must be because there is an
ODocument.field(fieldName, value) method.

*Persisting additional class metadata*

There is a fundamental mismatch between the way that OrientDB persists
classes and this scheme.  Namely that each OClassVersion (the current
equivalent of OClassImpl) is a member of an OClassSet.  Each OClassSet
shares a table of nameId -> name mappings between all of it's child
OClassVersions.  The logical way to persist this would be:

OClassSet {
    int classId;
    Map<Integer, String> nameIdMap;
    List<OClassVersion> versions;
}

Piggybacking OClassSet on top of OClassImpl doesn't seem the right way
to do this.

Additionally there will need to be persisted a database global map of
classId -> OClassSet.

I'm open to suggestions as to how to achieve this.  These special
documents probably cannot be persisted themselves in the binary format
(without some ugly hacking) as the OBinarySerializer is dependent on
looking up the OClassSet and nameIds.

*Removing bytes after deserialization*

Lazy serialization/deserialization is quite feasible by overriding the
various ODocument.field() methods.  i.e. when we read a record we only
parse the header (in fact only need to parse the first section of the
header initially).  Then if a field is requested that hasn't been
retrieved yet we scan the header entry and deserialize.  The question is
then raised, under what circumstances is it too expensive to hold on to
the backing byte array rather than just deserializing the remaining
fields and releasing it.  It would be useful if there was some mechanism
to determine if the record is part of a large query.  Or if the
OBinDocument itself provides a method to initiate this so that OrientDB
can manage it at a lower level.

*Current Code Cleanup*

*Bring back OBinHeaderEntry*

OBinProperty (extends OProperty) and OBinHeaderEntry both implement
IBinHeaderEntry.

OBinHeaderEntry was merged into OBinProperty in an effort to simplify. 
For an OBinRecordHeader we clone the schema declared OBinProperties from
the schema then add additional OBinProperties for any other fields that
exist (which is ugly because the properties are never added to the
schema).  OBinHeaderEntry is both lighter weight, easier to object pool
and makes a clear distinction between mutable and immutable.

*Tighten up the API*

Nothing public unless it needs to be exposed.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[orientdb] Schema Driven Binary Serialization - draft spec

Reply via email to