Re: [orientdb] Schema Driven Binary Serialization - draft spec

Steve Thu, 17 Apr 2014 06:40:54 -0700

Just a bit of an update on this.  I've had a chance to work on it the
last couple of days and I'll be pushing an update tomorrow or saturday.


I was getting rather stuck on the compounded complexities of dealing
with schema versioning along with binary serialization and integrating
with the Orient API.  So I decided to drop schema versioning
temporaily.  This is only very loosely coupled with binary serialization
so aside from maintaining the version field in the header format it's
not something that needs to be done all at once.  I'll get binary
serialization functional and then schema versioning can be revisited
later without any doubling up.

I'm a hair's breadth away from actual Orient persistence.  Last update
it was only serializing/deserializing to byte arrays.  Now we are able
to save binary records inside Orient.  This mean I've finally had to
take the plunge and start modifying orient-core but they are fairly
minimal changes.  The sticking point currently is deserializing schema. 
There are a few changes needed to schema structure which means
subclasses of OClassImpl and finding all the references to where that
class is created.  So I can serialize schema to DB but the remaining
issues are with deserializing schema.  Once I find a solid few hours to
work with it I think I can sort it out.

Then it's onto writing field level serializers to handle all the more
esoteric OTypes.

I've also managed to abstract field serialization so for debug we can
switch to JSON or other plain text field serializer with a single line
of code.  This has been quite helpful in the process.

Also realised that per record compression will be rather easy to do...
But that's in the extras bucket so will leave that as a bonus prize once
the core functions are sorted and stable.

On 09/04/14 19:53, Luca Garulli wrote:
> On 8 April 2014 02:47, Steve <[email protected]
> <mailto:[email protected]>> wrote:
>
>
>     On 08/04/14 10:11, Luca Garulli wrote:
>>     I like very much the idea about Class versioning to avoid massive
>>     update of database like RDBMSs do.
>
>     How you currently handle schema changes?  Do you do a full update
>     by default?  There are a few cases here.
>
>     1/ Expansive - i.e. a new field is added to schema without
>     constraints.  i.e. existing record remain valid under the new
>     schema. 
>
>     2/ Contractive - i.e. a new field is added with a constraint (like
>     NOT NULL) or an existing field has a constraint added.
>
>     3/ Not sure how to classify this one - an existing field has a
>     default value added to it or changed.
>
>     Case 1 we don't need to touch existing records.
>
>     Case 2 we need to check every record and update.  Or in the NOT
>     NULL case (typically you would also add a default value at the
>     same time) this *could* be done lazily but with a caveat.  If you
>     change the default value a second time and a record hasn't been
>     updated since the default value was first added then the record
>     will end up with the second default value.  The user probably
>     reasonably expects it will have the first default value.
>
>     Case 3 see case 2.
>
>
> Good analysis, we should classify the operation that need a full
> update and the operation can work in mode (1).
>  
>
>>         *Persisting additional class metadata*
>>
>>         There is a fundamental mismatch between the way that OrientDB
>>         persists classes and this scheme.  Namely that each
>>         OClassVersion (the current equivalent of OClassImpl) is a
>>         member of an OClassSet.  Each OClassSet shares a table of
>>         nameId -> name mappings between all of it's child
>>         OClassVersions.  The logical way to persist this would be:
>>
>>         OClassSet {
>>             int classId;
>>             Map<Integer, String> nameIdMap;
>>             List<OClassVersion> versions;
>>         }
>>
>>
>>     What's the content of nameIdMap? What nameId stands for?
>
>     Actually it's a List<String>.  The nameId is the index in the list
>     of the string.  But conceptually it's used like a map i.e. read
>     nameId from record header then lookup the string field name using
>     nameId as the index.
>
>
> So it's the map for field names. Why a Map and not just an array?
>  
>
>>         Piggybacking OClassSet on top of OClassImpl doesn't seem the
>>         right way to do this.
>>
>>         Additionally there will need to be persisted a database
>>         global map of classId -> OClassSet.
>>
>>         I'm open to suggestions as to how to achieve this.  These
>>         special documents probably cannot be persisted themselves in
>>         the binary format (without some ugly hacking) as the
>>         OBinarySerializer is dependent on looking up the OClassSet
>>         and nameIds.
>>
>>
>>     We've a Schema that can manage this. Schema record is marshalled
>>     like others, so we can add what we want.
>
>     I noticed OClass has a backing document.  I guess the issue is
>     that OClassSet doesn't have properties so making it inherit from
>     OClassImpl doesn't really make sense.  What we need to do is
>     embedd OClassImpl's in an OClassSet.
>
>
> Ok.
>  
>
>>         *Removing bytes after deserialization*
>>
>>         Lazy serialization/deserialization is quite feasible by
>>         overriding the various ODocument.field() methods.  i.e. when
>>         we read a record we only parse the header (in fact only need
>>         to parse the first section of the header initially).  Then if
>>         a field is requested that hasn't been retrieved yet we scan
>>         the header entry and deserialize.  The question is then
>>         raised, under what circumstances is it too expensive to hold
>>         on to the backing byte array rather than just deserializing
>>         the remaining fields and releasing it.  It would be useful if
>>         there was some mechanism to determine if the record is part
>>         of a large query.  Or if the OBinDocument itself provides a
>>         method to initiate this so that OrientDB can manage it at a
>>         lower level.
>>
>>
>>     I'd like to explore the road to completely avoid to use the
>>     Map<String,Object> of ODocument's _fieldValues. In facts, with an
>>     efficient marshallin/unmarshalling we could do it at the fly.
>>
>>     PROS:
>>     - Less RAM used and less objects in Garbage Collector (have you
>>     ever seen tons of Map.Entry?)
>>     - Less copies of buffers: the byte[] could be the same read from
>>     the OStorage layer
>>     - No need of Level2 cache anymore: DiskCache keeps pages, so
>>     storing the unmarshalled Document has no more sense
>>
>>     CONS:
>>     - Slower access to the fields multiple times, but in this case
>>     developers could call field() once and store the content in a
>>     local variable
>>
>>     WDYT?
>
>     I had thought of adding that to the existing implementation by
>     overriding field().  But there's one major gotcha.  Every call to
>     the same field will return a different object.  so o1 != o2 but
>     o1.equals(o2) depends on whether o1 has implemented equals(). 
>     This could be messy for the user.
>
>     With regard to excess Map.Entry I think the Trove library could
>     help here.  It doesn't create Entry objects internally, it's
>     faster than HashMap and more efficient. 
>
>     Potentially we could optimise internal representation using arrays
>     though.  for schema declared field we'd only need one map per
>     class to map field name -> array index.  For schemaless fields
>     we'd still need a map though I'll ponder this, there may be
>     another way.
>
>
> I like the array solution more than using Trove.
>  
>
>>     We could also use a hybrid approach or different implementation
>>     of ODocument to let the developer to decide what to use.
>
>     Good idea.  If they are using predominantly schemaless classes
>     then it might be advantageous.
>
>
> We could have OSchemaFullDocument and OSchemaLessDocument impls. I
> don't know if makes sense.
>  
>
>>     *Partial serialization*
>>
>>     I'd like also to explore the partial serialization case.
>>
>>     I mean the case when a user executes a query, browse the result
>>     set, change a document field and send it back to the database to
>>     be saved.
>>
>>     Now we keep tracks of changes in the ODocument (used also by
>>     indexes to maintain aligned), so we could marshall and overwrite
>>     only the changed field in byte[].
>>
>>     This feature must go together with abandon usage of Map to store
>>     field values but use only the byte[].
>
>     My original thoughts on this were much the same.  However since we
>     first explored the issue I've been building custom disk
>     persistence mechanisms for bitcoin.  One was basically a disk
>     backed arraylist.  When I switched that implementation to grouping
>     entries into 4k blocks (to match the underlying disk subsystem) I
>     noticed it didn't affect performance at all.  So I am question
>     whether for most use cases there is any benefit to partial updates
>     rather than rewriting the whole record.  Partial adds a lot of
>     complexity (i.e. potential bugs) as you have to handle data holes,
>     possibly shifting fields around etc.  Even in a large record than
>     spans multiple disk blocks the advantage is not so great if the
>     blocks are contiguous on disk as the bottleneck is seek time not
>     write time.  You would presumably read the whole byte array for a
>     record into memory when reading as you don't know where in the
>     record the field you want is until you've parsed the record header
>     so if you try to be cute and parse the header first then retrieve
>     the data you potentially have another disk seek which kind of
>     nullifies the benefit.    So we don't have an advantage with disk
>     access time, we don't have an advantage with byte array
>     allocation, the only advantage we really get is the cost of
>     deserializing the record in memory.
>
>     For very large records however this does change the dynamics quite
>     a bit and probably has a valid use case.  Perhaps we need multiple
>     internal implementations?
>
>
> My goal was only to avoid marshalling/unarshalling the entire record
> when we just set an integer field. With fixed length fields we could
> update fields at very low cost, for variable size fields we should
> shift the content next and update pointers back. Or in this case we
> could marshall the entire record for now.
>
> Lvc@
>
> -- 
>
> ---
> You received this message because you are subscribed to the Google
> Groups "OrientDB" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to [email protected]
> <mailto:[email protected]>.
> For more options, visit https://groups.google.com/d/optout.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [orientdb] Schema Driven Binary Serialization - draft spec

Reply via email to