Re: [orientdb] Schema driven serialization #1890

Luca Garulli Wed, 19 Feb 2014 04:12:07 -0800

Hi Steve,
your previous email shows me your skill on this, so I'm confident you could
give us a big contribution for a faster and more efficient release 2.0 ;-)


Lvc@



On 19 February 2014 12:53, Steve <[email protected]> wrote:

>  Hi Luca,
>
> I'll give it a go with the real ODB code.  The reason I didn't is because
> I'm actually quite new to ODB even as an end user but your instructions
> will set me in the right direction.  Most of my experience with data
> serialization formats has been with Bitcoin which was mostly for network
> protocol use cases rather than big-data storage.  But that was also a high
> performance scenario so I guess there are a lot of parallels.
>
>
> On 19/02/14 21:33, Luca Garulli wrote:
>
>  Hi Steve,
>  sorry for such delay.
>
>  I like your ideas, I think this is the right direction. varint8 e
> varint16 could be a good way to save space, but we should consider when
> this slows down some use cases, like partial field loading.
>
>  About the POC you created I think it would be much more useful if you
> play with real documents. It's easy and you could push it to a separate
> branch to let to us and other developers to contribute & test. WDYT?
>
>  Follow these steps:
>
>   (1) create your serializer
>
>  This is the skeleton of the class to implement:
>
>  public class BinaryDocumentSerializer implements ORecordSerializer {
>  public static final String NAME = "binarydoc";
>
>          // UN-MARSHALLING
>  public ORecordInternal<?> fromStream(final byte[] iSource) {
>  }
>
>          // PARTIAL UN-MARSHALLING
>  public ORecordInternal<?> fromStream(final byte[] iSource, final
> ORecordInternal<?> iRecord, String[] iFields) {
>  }
>
>          //  MARSHALLING
>  public byte[] toStream(final ORecordInternal<?> iSource, boolean
> iOnlyDelta) {
>  }
>  }
>
>  (2) register your implementation
>
>  ORecordSerializerFactory.instance().register(BinaryDocumentSerializer.NAME,
> new BinaryDocumentSerializer());
>
>  (3) create a new ODocument subclass
>
>  Then create a new class that extends ODocument but uses your
> implementation:
>
>  public class BinaryDocument extends ODocument {
>   protected void setup() {
>     super.setup();
>     _recordFormat =
> ORecordSerializerFactory.instance().getFormat(BinaryDocumentSerializer.NAME);
>   }
>  }
>
>  (4) Try it!
>
>  And now try to create a BinaryDocument, set fields and call .save(). The
> method BinaryDocumentSerializer.toStream() will be called.
>
>
>
>  Lvc@
>
>
>
> On 18 February 2014 06:08, Steve <[email protected]> wrote:
>
>>
>>   The point is: why should I store the field name when I've declared
>> that a class has such names?
>>
>>
>>  Precisely.  But I don't think you need to limit it to the declarative
>> case... i.e. schema-full.  By using a numbered field_id you cover
>> schema-full, schema-mixed and schema-free cases with a single solution.
>> There are two issues here... Performance and storage space.  Arguably
>> improving storage space also improves performance in a bigdata context
>> because it allows caches to retain more logical units in memory.
>>
>>
>> I've been having a good think about this and I think I've come up with a
>> viable plan that solves a few problems.  It requires schema versioning.
>>
>> I was hesitant to make this suggestion as it introduces more complexity
>> in order to improve compactness and unnecessary reading of metadata.
>> However I see from you original proposal that the problem exists there as
>> well.:
>>
>> *Cons:*
>>
>>    - *Every time the schema changes, a full scan and update of record is
>>    needed*
>>
>> The proposal is that record metadata is made of 3 parts + a meta-header
>> (which in most cases would be 2-3 bytes.  Fixed length schema declared
>> fields, variable length schema declared fields and schema-less fields.  The
>> problem as you point out with a single schema per class is that if you
>> change the schema you have to update every record. If you insert a field
>> before the last field you would likely have to rewrite every record from
>> scratch.
>>
>> First a couple of definitions:
>>
>> Definitions:
>>
>> varint8: a standard varint that is built from any number of 1 byte
>> segments.  The first bit of each segment is set to 1 if there is a
>> subsequent segment.  A number is constructed by concatenating the last 7
>> bits of each byte.  This allows for the following value ranges:
>> 1 byte : 127
>> 2 bytes: 16k
>> 3 bytes: 2m
>> 4 bytes: 268m
>>
>> varint16: same as varint8 but the first segment is 16 bits and all
>> subsequent are 8 bits
>> 2 bytes: 32k
>> 3 bytes: 4m
>> 4 bytes: 536m
>>
>> nameId: an int (or long) index from a field name array.  This index could
>> be one per JVM or one per class.  Getting the field name using the nameId
>> is a single array lookup.  This is stored on disk as a varint16 allowing
>> 32k names before we need to use a 3rd byte for name storage.
>>
>> I propose a record header that looks like this:
>>
>> version:varint8|header_length:varint8|variable_length_declared_field_headers|undeclared_field_headers
>>
>> Version is the schema version and would in most cases be only 1 byte.
>> You would need 128 schema changes to make it 2 bytes.  This proposal would
>> require a cleanup tool that could scan all record and reset them all to
>> most recent schema version (at which point version is reset to 0).  But it
>> would be necessary on every schema change.  The user could choose if and
>> when to run it.  The only time you would need to do a full scan would be if
>> you are introducing some sort of constraint and needed to validate that
>> existing records don't violate the constraint.
>>
>> When a new schema is generated the user defined order of fields is stored
>> in each field's Schema entry.  Internally the fields are rearranged so that
>> all fixed length fields come first.  Because the order and length of fields
>> is known by the schema there is no need to store offset/length in the
>> record header.
>>
>> Variable length declared fields need only a length and offset and the
>> rest of the field meta data is determined by the schema.
>>
>> Finally undeclared (schema-less) fields require additional header data:
>> nameId:varint16|dataType:byte?|offset:varint8|length:varint8
>>
>> I've attached a very rough partial implementation to try and demonstrate
>> the concept.  It won't run because a number of low level functions aren't
>> implemented but if you start at the Record class you should be able to
>> follow the code through from the read(int nameId) method.  It demonstrates
>> how you would read a schema/fixed, schema/variable and non-schema field
>> from the record using random access.
>>
>> I think I've made one significant mistake in demo code.  I've used
>> varints to store offset/length for schema-variable-length fields.  This
>> means you cannot find the header for one of those field without scanning
>> that entire section of the header.  The same is true for schema-less
>> however in this case it doesn't matter since we don't know what fields are
>> there (or the order) from the schema we have no option but to scan that
>> part of the header to find the field metadata we are looking for.
>>
>> The advantage though of storing length as a varint is that perhaps in a
>> majority of cases field length is going to be less than 127 bytes which
>> means you can store it in a single byte rather than 4 or 8 for an int or
>> long.
>>
>> We have a couple of potential tradeoffs to consider here (only relavent
>> to the schema declared variable length fields).  By doing a full scan of
>> the header we can use varints with impunity and can gain storage benefits
>> from it.  We can also dispense with storing the offset field altogether as
>> it can be calculated during the header scan.  So potentially reducing the
>> header entry for each field from 8 bytes (if you use int) to as little as
>> 1.  Also we remove a potential constraint on maximum field length.  On the
>> other hand if we use fixed length fields (like int or long) to store
>> offset/length we gain random access in the header.
>>
>> I can see two edge cases where this sort of scheme would run into
>> difficulties or potentially create a storage penalty.  1) a dataset that
>> has a vast number of different fields.  Perhaps where the user is for some
>> reason using the field name as a kind of meta-data which would increase the
>> in-memory field_name table and 2) Where a user has adopted the (rather
>> hideous) mongoDB solution of abbreviating field names and taken it to the
>> extreme of a single character field name.  In this case my proposed 16 bit
>> minimum nameIndex size would be 8 bits over what could be achieved.
>>
>> The first issue could be dealt with by only by making the tokenised field
>> name feature available only in the case where the field is declared in
>> schema (basically your proposal).  But would also require a flag on
>> internally stored field_name token to indicate if it's a schema token or
>> schema-less full field name.  It could be mitigated by giving an option for
>> full field_name storage (I would imagine this would be a rare use case).
>>
>> The second issue (if deemed important enough to address) could also be be
>> dealt with by a separate implementation of something like IFieldNameDecoder
>> that uses an 8 bit segment and asking the user to declare a cluster/class
>> as using that if they have a use case for it.
>>
>   --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "OrientDB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>  --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "OrientDB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: [orientdb] Schema driven serialization #1890

Reply via email to