Re: [orientdb] Schema driven serialization #1890

Luca Garulli Wed, 19 Feb 2014 04:37:41 -0800

:-)


On 19 February 2014 13:13, Steve Coughlan <[email protected]> wrote:

> Flattery will get you everwhere.. lol :)
> On Feb 19, 2014 10:11 PM, "Luca Garulli" <[email protected]> wrote:
>
>> Hi Steve,
>> your previous email shows me your skill on this, so I'm confident you
>> could give us a big contribution for a faster and more efficient release
>> 2.0 ;-)
>>
>> Lvc@
>>
>>
>>
>> On 19 February 2014 12:53, Steve <[email protected]> wrote:
>>
>>>  Hi Luca,
>>>
>>> I'll give it a go with the real ODB code.  The reason I didn't is
>>> because I'm actually quite new to ODB even as an end user but your
>>> instructions will set me in the right direction.  Most of my experience
>>> with data serialization formats has been with Bitcoin which was mostly for
>>> network protocol use cases rather than big-data storage.  But that was also
>>> a high performance scenario so I guess there are a lot of parallels.
>>>
>>>
>>> On 19/02/14 21:33, Luca Garulli wrote:
>>>
>>>  Hi Steve,
>>>  sorry for such delay.
>>>
>>>  I like your ideas, I think this is the right direction. varint8 e
>>> varint16 could be a good way to save space, but we should consider when
>>> this slows down some use cases, like partial field loading.
>>>
>>>  About the POC you created I think it would be much more useful if you
>>> play with real documents. It's easy and you could push it to a separate
>>> branch to let to us and other developers to contribute & test. WDYT?
>>>
>>>  Follow these steps:
>>>
>>>   (1) create your serializer
>>>
>>>  This is the skeleton of the class to implement:
>>>
>>>  public class BinaryDocumentSerializer implements ORecordSerializer {
>>>  public static final String NAME = "binarydoc";
>>>
>>>          // UN-MARSHALLING
>>>  public ORecordInternal<?> fromStream(final byte[] iSource) {
>>>  }
>>>
>>>          // PARTIAL UN-MARSHALLING
>>>  public ORecordInternal<?> fromStream(final byte[] iSource, final
>>> ORecordInternal<?> iRecord, String[] iFields) {
>>>  }
>>>
>>>          //  MARSHALLING
>>>  public byte[] toStream(final ORecordInternal<?> iSource, boolean
>>> iOnlyDelta) {
>>>  }
>>>  }
>>>
>>>  (2) register your implementation
>>>
>>>  ORecordSerializerFactory.instance().register(BinaryDocumentSerializer.NAME,
>>> new BinaryDocumentSerializer());
>>>
>>>  (3) create a new ODocument subclass
>>>
>>>  Then create a new class that extends ODocument but uses your
>>> implementation:
>>>
>>>  public class BinaryDocument extends ODocument {
>>>   protected void setup() {
>>>     super.setup();
>>>     _recordFormat =
>>> ORecordSerializerFactory.instance().getFormat(BinaryDocumentSerializer.NAME);
>>>   }
>>>  }
>>>
>>>  (4) Try it!
>>>
>>>  And now try to create a BinaryDocument, set fields and call .save().
>>> The method BinaryDocumentSerializer.toStream() will be called.
>>>
>>>
>>>
>>>  Lvc@
>>>
>>>
>>>
>>> On 18 February 2014 06:08, Steve <[email protected]> wrote:
>>>
>>>>
>>>>   The point is: why should I store the field name when I've declared
>>>> that a class has such names?
>>>>
>>>>
>>>>  Precisely.  But I don't think you need to limit it to the declarative
>>>> case... i.e. schema-full.  By using a numbered field_id you cover
>>>> schema-full, schema-mixed and schema-free cases with a single solution.
>>>> There are two issues here... Performance and storage space.  Arguably
>>>> improving storage space also improves performance in a bigdata context
>>>> because it allows caches to retain more logical units in memory.
>>>>
>>>>
>>>> I've been having a good think about this and I think I've come up with
>>>> a viable plan that solves a few problems.  It requires schema versioning.
>>>>
>>>> I was hesitant to make this suggestion as it introduces more complexity
>>>> in order to improve compactness and unnecessary reading of metadata.
>>>> However I see from you original proposal that the problem exists there as
>>>> well.:
>>>>
>>>> *Cons:*
>>>>
>>>>    - *Every time the schema changes, a full scan and update of record
>>>>    is needed*
>>>>
>>>> The proposal is that record metadata is made of 3 parts + a meta-header
>>>> (which in most cases would be 2-3 bytes.  Fixed length schema declared
>>>> fields, variable length schema declared fields and schema-less fields.  The
>>>> problem as you point out with a single schema per class is that if you
>>>> change the schema you have to update every record. If you insert a field
>>>> before the last field you would likely have to rewrite every record from
>>>> scratch.
>>>>
>>>> First a couple of definitions:
>>>>
>>>> Definitions:
>>>>
>>>> varint8: a standard varint that is built from any number of 1 byte
>>>> segments.  The first bit of each segment is set to 1 if there is a
>>>> subsequent segment.  A number is constructed by concatenating the last 7
>>>> bits of each byte.  This allows for the following value ranges:
>>>> 1 byte : 127
>>>> 2 bytes: 16k
>>>> 3 bytes: 2m
>>>> 4 bytes: 268m
>>>>
>>>> varint16: same as varint8 but the first segment is 16 bits and all
>>>> subsequent are 8 bits
>>>> 2 bytes: 32k
>>>> 3 bytes: 4m
>>>> 4 bytes: 536m
>>>>
>>>> nameId: an int (or long) index from a field name array.  This index
>>>> could be one per JVM or one per class.  Getting the field name using the
>>>> nameId is a single array lookup.  This is stored on disk as a varint16
>>>> allowing 32k names before we need to use a 3rd byte for name storage.
>>>>
>>>> I propose a record header that looks like this:
>>>>
>>>> version:varint8|header_length:varint8|variable_length_declared_field_headers|undeclared_field_headers
>>>>
>>>> Version is the schema version and would in most cases be only 1 byte.
>>>> You would need 128 schema changes to make it 2 bytes.  This proposal would
>>>> require a cleanup tool that could scan all record and reset them all to
>>>> most recent schema version (at which point version is reset to 0).  But it
>>>> would be necessary on every schema change.  The user could choose if and
>>>> when to run it.  The only time you would need to do a full scan would be if
>>>> you are introducing some sort of constraint and needed to validate that
>>>> existing records don't violate the constraint.
>>>>
>>>> When a new schema is generated the user defined order of fields is
>>>> stored in each field's Schema entry.  Internally the fields are rearranged
>>>> so that all fixed length fields come first.  Because the order and length
>>>> of fields is known by the schema there is no need to store offset/length in
>>>> the record header.
>>>>
>>>> Variable length declared fields need only a length and offset and the
>>>> rest of the field meta data is determined by the schema.
>>>>
>>>> Finally undeclared (schema-less) fields require additional header data:
>>>> nameId:varint16|dataType:byte?|offset:varint8|length:varint8
>>>>
>>>> I've attached a very rough partial implementation to try and
>>>> demonstrate the concept.  It won't run because a number of low level
>>>> functions aren't implemented but if you start at the Record class you
>>>> should be able to follow the code through from the read(int nameId)
>>>> method.  It demonstrates how you would read a schema/fixed, schema/variable
>>>> and non-schema field from the record using random access.
>>>>
>>>> I think I've made one significant mistake in demo code.  I've used
>>>> varints to store offset/length for schema-variable-length fields.  This
>>>> means you cannot find the header for one of those field without scanning
>>>> that entire section of the header.  The same is true for schema-less
>>>> however in this case it doesn't matter since we don't know what fields are
>>>> there (or the order) from the schema we have no option but to scan that
>>>> part of the header to find the field metadata we are looking for.
>>>>
>>>> The advantage though of storing length as a varint is that perhaps in a
>>>> majority of cases field length is going to be less than 127 bytes which
>>>> means you can store it in a single byte rather than 4 or 8 for an int or
>>>> long.
>>>>
>>>> We have a couple of potential tradeoffs to consider here (only relavent
>>>> to the schema declared variable length fields).  By doing a full scan of
>>>> the header we can use varints with impunity and can gain storage benefits
>>>> from it.  We can also dispense with storing the offset field altogether as
>>>> it can be calculated during the header scan.  So potentially reducing the
>>>> header entry for each field from 8 bytes (if you use int) to as little as
>>>> 1.  Also we remove a potential constraint on maximum field length.  On the
>>>> other hand if we use fixed length fields (like int or long) to store
>>>> offset/length we gain random access in the header.
>>>>
>>>> I can see two edge cases where this sort of scheme would run into
>>>> difficulties or potentially create a storage penalty.  1) a dataset that
>>>> has a vast number of different fields.  Perhaps where the user is for some
>>>> reason using the field name as a kind of meta-data which would increase the
>>>> in-memory field_name table and 2) Where a user has adopted the (rather
>>>> hideous) mongoDB solution of abbreviating field names and taken it to the
>>>> extreme of a single character field name.  In this case my proposed 16 bit
>>>> minimum nameIndex size would be 8 bits over what could be achieved.
>>>>
>>>> The first issue could be dealt with by only by making the tokenised
>>>> field name feature available only in the case where the field is declared
>>>> in schema (basically your proposal).  But would also require a flag on
>>>> internally stored field_name token to indicate if it's a schema token or
>>>> schema-less full field name.  It could be mitigated by giving an option for
>>>> full field_name storage (I would imagine this would be a rare use case).
>>>>
>>>> The second issue (if deemed important enough to address) could also be
>>>> be dealt with by a separate implementation of something like
>>>> IFieldNameDecoder that uses an 8 bit segment and asking the user to declare
>>>> a cluster/class as using that if they have a use case for it.
>>>>
>>>   --
>>>
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "OrientDB" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>>
>>>  --
>>>
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "OrientDB" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>
>>  --
>>
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "OrientDB" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>  --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "OrientDB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: [orientdb] Schema driven serialization #1890

Reply via email to