Re: [orientdb] Schema driven serialization #1890

Steve Wed, 19 Feb 2014 03:54:54 -0800

Hi Luca,

I'll give it a go with the real ODB code.  The reason I didn't is
because I'm actually quite new to ODB even as an end user but your
instructions will set me in the right direction.  Most of my experience
with data serialization formats has been with Bitcoin which was mostly
for network protocol use cases rather than big-data storage.  But that
was also a high performance scenario so I guess there are a lot of
parallels.


On 19/02/14 21:33, Luca Garulli wrote:
> Hi Steve,
> sorry for such delay.
>
> I like your ideas, I think this is the right direction. varint8 e
> varint16 could be a good way to save space, but we should consider
> when this slows down some use cases, like partial field loading.
>
> About the POC you created I think it would be much more useful if you
> play with real documents. It's easy and you could push it to a
> separate branch to let to us and other developers to contribute &
> test. WDYT?
>
> Follow these steps:
>
> (1) create your serializer
>
> This is the skeleton of the class to implement:
>
> public class BinaryDocumentSerializer implements ORecordSerializer {
> public static final StringNAME= "binarydoc";
>
>         // UN-MARSHALLING
> public ORecordInternal<?> fromStream(final byte[] iSource) {
> }
>
>         // PARTIAL UN-MARSHALLING
> public ORecordInternal<?> fromStream(final byte[] iSource, final
> ORecordInternal<?> iRecord, String[] iFields) {
> }
>
>         //  MARSHALLING
> public byte[] toStream(final ORecordInternal<?> iSource, boolean
> iOnlyDelta) {
> }
> }
>
> (2) register your implementation
>
> ORecordSerializerFactory.instance().register(BinaryDocumentSerializer.NAME,
> new BinaryDocumentSerializer());
>
> (3) create a new ODocument subclass
>
> Then create a new class that extends ODocument but uses your
> implementation:
>
> public class BinaryDocument extends ODocument {
>  protected void setup() {
>     super.setup();
>     _recordFormat =
> ORecordSerializerFactory.instance().getFormat(BinaryDocumentSerializer.NAME);
>   }
> }
>
> (4) Try it!
>
> And now try to create a BinaryDocument, set fields and call .save().
> The method BinaryDocumentSerializer.toStream() will be called. 
>
>
>
> Lvc@
>
>
>
> On 18 February 2014 06:08, Steve <[email protected]
> <mailto:[email protected]>> wrote:
>
>
>>      The point is: why should I store the field name when I've
>>     declared that a class has such names?
>
>     Precisely.  But I don't think you need to limit it to the
>     declarative case... i.e. schema-full.  By using a numbered
>     field_id you cover schema-full, schema-mixed and schema-free cases
>     with a single solution.   There are two issues here... Performance
>     and storage space.  Arguably improving storage space also improves
>     performance in a bigdata context because it allows caches to
>     retain more logical units in memory.
>
>
>     I've been having a good think about this and I think I've come up
>     with a viable plan that solves a few problems.  It requires schema
>     versioning.
>
>     I was hesitant to make this suggestion as it introduces more
>     complexity in order to improve compactness and unnecessary reading
>     of metadata.  However I see from you original proposal that the
>     problem exists there as well.:
>
>     /Cons:/
>
>     //
>
>       * /Every time the schema changes, a full scan and update of
>         record is needed/
>
>     The proposal is that record metadata is made of 3 parts + a
>     meta-header (which in most cases would be 2-3 bytes.  Fixed length
>     schema declared fields, variable length schema declared fields and
>     schema-less fields.  The problem as you point out with a single
>     schema per class is that if you change the schema you have to
>     update every record. If you insert a field before the last field
>     you would likely have to rewrite every record from scratch.
>
>     First a couple of definitions:
>
>     Definitions:
>
>     varint8: a standard varint that is built from any number of 1 byte
>     segments.  The first bit of each segment is set to 1 if there is a
>     subsequent segment.  A number is constructed by concatenating the
>     last 7 bits of each byte.  This allows for the following value ranges:
>     1 byte : 127
>     2 bytes: 16k
>     3 bytes: 2m
>     4 bytes: 268m
>
>     varint16: same as varint8 but the first segment is 16 bits and all
>     subsequent are 8 bits
>     2 bytes: 32k
>     3 bytes: 4m
>     4 bytes: 536m
>
>     nameId: an int (or long) index from a field name array.  This
>     index could be one per JVM or one per class.  Getting the field
>     name using the nameId is a single array lookup.  This is stored on
>     disk as a varint16 allowing 32k names before we need to use a 3rd
>     byte for name storage.
>
>     I propose a record header that looks like this:
>     
> version:varint8|header_length:varint8|variable_length_declared_field_headers|undeclared_field_headers
>
>     Version is the schema version and would in most cases be only 1
>     byte.  You would need 128 schema changes to make it 2 bytes.  This
>     proposal would require a cleanup tool that could scan all record
>     and reset them all to most recent schema version (at which point
>     version is reset to 0).  But it would be necessary on every schema
>     change.  The user could choose if and when to run it.  The only
>     time you would need to do a full scan would be if you are
>     introducing some sort of constraint and needed to validate that
>     existing records don't violate the constraint.
>
>     When a new schema is generated the user defined order of fields is
>     stored in each field's Schema entry.  Internally the fields are
>     rearranged so that all fixed length fields come first.  Because
>     the order and length of fields is known by the schema there is no
>     need to store offset/length in the record header.
>
>     Variable length declared fields need only a length and offset and
>     the rest of the field meta data is determined by the schema.
>
>     Finally undeclared (schema-less) fields require additional header
>     data:
>     nameId:varint16|dataType:byte?|offset:varint8|length:varint8
>
>     I've attached a very rough partial implementation to try and
>     demonstrate the concept.  It won't run because a number of low
>     level functions aren't implemented but if you start at the Record
>     class you should be able to follow the code through from the
>     read(int nameId) method.  It demonstrates how you would read a
>     schema/fixed, schema/variable and non-schema field from the record
>     using random access.
>
>     I think I've made one significant mistake in demo code.  I've used
>     varints to store offset/length for schema-variable-length fields. 
>     This means you cannot find the header for one of those field
>     without scanning that entire section of the header.  The same is
>     true for schema-less however in this case it doesn't matter since
>     we don't know what fields are there (or the order) from the schema
>     we have no option but to scan that part of the header to find the
>     field metadata we are looking for.
>
>     The advantage though of storing length as a varint is that perhaps
>     in a majority of cases field length is going to be less than 127
>     bytes which means you can store it in a single byte rather than 4
>     or 8 for an int or long. 
>
>     We have a couple of potential tradeoffs to consider here (only
>     relavent to the schema declared variable length fields).  By doing
>     a full scan of the header we can use varints with impunity and can
>     gain storage benefits from it.  We can also dispense with storing
>     the offset field altogether as it can be calculated during the
>     header scan.  So potentially reducing the header entry for each
>     field from 8 bytes (if you use int) to as little as 1.  Also we
>     remove a potential constraint on maximum field length.  On the
>     other hand if we use fixed length fields (like int or long) to
>     store offset/length we gain random access in the header.
>
>     I can see two edge cases where this sort of scheme would run into
>     difficulties or potentially create a storage penalty.  1) a
>     dataset that has a vast number of different fields.  Perhaps where
>     the user is for some reason using the field name as a kind of
>     meta-data which would increase the in-memory field_name table and
>     2) Where a user has adopted the (rather hideous) mongoDB solution
>     of abbreviating field names and taken it to the extreme of a
>     single character field name.  In this case my proposed 16 bit
>     minimum nameIndex size would be 8 bits over what could be achieved.
>
>     The first issue could be dealt with by only by making the
>     tokenised field name feature available only in the case where the
>     field is declared in schema (basically your proposal).  But would
>     also require a flag on internally stored field_name token to
>     indicate if it's a schema token or schema-less full field name. 
>     It could be mitigated by giving an option for full field_name
>     storage (I would imagine this would be a rare use case).
>
>     The second issue (if deemed important enough to address) could
>     also be be dealt with by a separate implementation of something
>     like IFieldNameDecoder that uses an 8 bit segment and asking the
>     user to declare a cluster/class as using that if they have a use
>     case for it.
>
> -- 
>  
> ---
> You received this message because you are subscribed to the Google
> Groups "OrientDB" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: [orientdb] Schema driven serialization #1890

Reply via email to