*Schema Driven Binary Serialization* (see issue #1890 <https://github.com/orientechnologies/orientdb/issues/1890>)
Code on Github <https://github.com/shadders/orientdb/tree/binary-serialization/binary> *Status *Note that current code on Github does not yet integrate with OrientDB. There are some issues to resolve with a structural change to the way OClass is represented before it can work properly (see 'Things to Deal With' section). Until then it is simply demonstrated by serializing/deserializing directly with arrays. I will make a seperate post with a list of questions I need to answer to progress. But there are some obvious questions that arise from this post so if anyone wants to answer them feel free. *Definitions* *Varint*: An integer that is represented by any number of 1 byte segments. The first bit of each segment is set to 1 if there is a subsequent segment. A number is constructed by concatenating the last 7 bits of each byte. This allows for the following value ranges: 1 byte : 127 2 bytes: 16k 3 bytes: 2m 4 bytes: 268m Number of bytes is not bounded, by reading into a BigInteger you could represent arbitrarily large integers although in this specification we only use them to represent Java int and long types. *nameId*: an int (or long) index from an array of field names. This index is proposed to be one per class. This may result in some redundancy of names but by keeping the number per class low we can represent field names in each record using less bytes. Getting the field name using the nameId is a single array lookup. This is stored on disk as a varint allowing 127 names in a single byte and 16k names per class before we need to use a 3rd byte for name storage per record. *Format Specification* The 'required' field indicates whether a header field is necessary to make this proposal work. If 'no' then it's an optional feature that we can choose to incorporate or not. *Header* *Field** * *Type** * *Required?** * *Implemented** * *Description** * HEADER_HEADER The first 4 fields of this section are padded to 6 bytes. This is because headerLength cannot be known until the entire header is written and as it is a varint it's own length in bytes is unknown. By padding to 6 bytes in the majority of cases headerLength will fit in the padded section. On the rare exceptions the header bytes to right will need to be shifted. format byte no yes Specifies both the header format and the data serialization format. The 4 most significant bits indicate header format and the 4 LSBs indicate data serialization format. e.g. if we wish to make use of the optional compressedbits header field we could specify this a different header format. Alternately this field could be used to indicate padding or data hole policy. For testing/debug it may be convenient to use a string data serialization format. Or if using compressedbits this field may specify which compression algorithm or settings to use. classId varint yes yes A unique id for the class this record represents. This gives a key to locating the OClassSet. 0 is reserved for the static instance SCHEMALESS_SET which has a single OClassVersion: SCHEMALESS and contains no fields. version varint yes yes The schema version within the class represented by OClassVersion which can be looked up from OClassSet. Versioning schemas allows us to updated the schema without necessarily having to update all records on disk. In the case where a constraint has narrowed (i.e. adding NOT NULL) to a property this may still be necessary but this an issue OrientDB has to contend with now so it can handled using the same policy. headerLength varint yes* yes Length (in bytes) of the complete header excluding dataLength. dataLength can be read by setting offset to headerLength. This avoids needing to parse the entire header to read dataLength and calculate the dataOffset. *If we accept that the full header must be parsed this field is not required/ fieldCount varint yes yes The number of fields in the header. This doesn't include fixed length fields (as all metadata for those is available from the schema). It does include null fields which in the case of a variable length schema declared field, means there isn't actually any information in the header. nullbitsLength varint yes yes The number of bytes to read for the nullbits array nullbits byte[] yes yes An array of bytes in which each bit (starting from LSB) is set to 1 if the corresponding internally ordered field is null. i.e. the first fixed length field is index 0 which corresponds to the rightmost bit. compressedbits byte[] no no A similar set of flag bits like nullbits but indicating whether a field's data is compressed. Additional field flags like this could be added for other attributes if needed. Note that length is not needed because it will be the same length as nullbits. SCHEMA DECLARED FIXED LENGTH FIELD HEADER ENTRIES There are not entries in the header for these fields as all required metadata is immutable per record and contained within the schema SCHEMA DECLARED VARIABLE LENGTH FIELD HEADER ENTRIES These entries do not require nameId or dataType as they immutable per record and contained within the schema Note: fields with a null value (indicated by nullbits) do not require an offset or length entry so in this case there will be no entry at all. However it is still counted in the 'fieldCount' header field. offset varint yes yes The offset within the data portion of the record. i.e. offset + dataOffset = real offset. length varint yes yes Length in bytes of the serialized field. Note that by keeping length in the header rather than prepending it to the data itself we gain the ability to scan for data holes without having the read the data itself. SCHEMALESS FIELD HEADER ENTRIES Fields that are not declared in the schema require additional metadata. Note: fields with a null value (indicated by nullbits) do not require an offset or length entry so they will only have a nameId and a dataType (arguable whether dataType is needed either?) nameId varint yes yes Index of the actual field name String within the class. This can be retrieved by a simple ArrayList.get(nameId). dataType byte yes yes Equivalent to OType.id which is currently equivilent to OType.ordinal() although current code does not guarantee this will always be the case (I think it should). offset varint yes yes The offset within the data portion of the record. i.e. offset + dataOffset = real offset. length varint yes yes Length in bytes of the serialized field. END OF FIELD ENTRIES dataLength uint32 yes yes Length of data (which may optionally include some reserved space). This field is written as a uint32 rather than a varint to simplify calculating the dataOffset internally. dataOffset is not written in the header as it simply headerLength + 4 bytes (for the uint32). *Data* The header format is agnostic to the data serialization format. All that it requires is the the serializer be able to recieve an offset to write to and return a dataLength once written. Currently we are using the existing serializers retrieved from OBinarySerializerFactory. However not all OTypes are covered and some introduce redundant length fields. Embedded data types could in theory use the same header format and serialize the same way as a parent type. *Things to Deal With* *Class inheritance *Currently no account is taken of parent classes. The code needs to consolidate parent class fields and sort them all internally the same way a single class does. *Embedded (and collections/map of embedded)* Embedded data types could in theory use the same header format and serialize the same way as a parent type within the parent's data section. Need to look at whether to use this approach (simple) or whether there is any benefit in consolidating fields of embedded documents into the parent header. *Links (and collections/maps) *This hasn't been visited yet. *OType.ANY* No binary serializer currently exists that can handle OType.ANY. I need to find out if there is existing code to determine type from untyped input. I assume there must be because there is an ODocument.field(fieldName, value) method. *Persisting additional class metadata* There is a fundamental mismatch between the way that OrientDB persists classes and this scheme. Namely that each OClassVersion (the current equivalent of OClassImpl) is a member of an OClassSet. Each OClassSet shares a table of nameId -> name mappings between all of it's child OClassVersions. The logical way to persist this would be: OClassSet { int classId; Map<Integer, String> nameIdMap; List<OClassVersion> versions; } Piggybacking OClassSet on top of OClassImpl doesn't seem the right way to do this. Additionally there will need to be persisted a database global map of classId -> OClassSet. I'm open to suggestions as to how to achieve this. These special documents probably cannot be persisted themselves in the binary format (without some ugly hacking) as the OBinarySerializer is dependent on looking up the OClassSet and nameIds. *Removing bytes after deserialization* Lazy serialization/deserialization is quite feasible by overriding the various ODocument.field() methods. i.e. when we read a record we only parse the header (in fact only need to parse the first section of the header initially). Then if a field is requested that hasn't been retrieved yet we scan the header entry and deserialize. The question is then raised, under what circumstances is it too expensive to hold on to the backing byte array rather than just deserializing the remaining fields and releasing it. It would be useful if there was some mechanism to determine if the record is part of a large query. Or if the OBinDocument itself provides a method to initiate this so that OrientDB can manage it at a lower level. *Current Code Cleanup* *Bring back OBinHeaderEntry* OBinProperty (extends OProperty) and OBinHeaderEntry both implement IBinHeaderEntry. OBinHeaderEntry was merged into OBinProperty in an effort to simplify. For an OBinRecordHeader we clone the schema declared OBinProperties from the schema then add additional OBinProperties for any other fields that exist (which is ugly because the properties are never added to the schema). OBinHeaderEntry is both lighter weight, easier to object pool and makes a clear distinction between mutable and immutable. *Tighten up the API* Nothing public unless it needs to be exposed. -- --- You received this message because you are subscribed to the Google Groups "OrientDB" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
