Re: [orientdb] Schema driven serialization #1890

Andrey Lomakin Thu, 20 Feb 2014 02:13:24 -0800

Hi Steve,
Good that you are going to help us.
Few additional information:
1.  We already have binary serialization support you can see it here
com.orientechnologies.common.serialization.types.OBinarySerializer so
obviously we should not have several version of the same. Also I think it
will be interesting for you to look at this issue and discussion here
https://github.com/orientechnologies/orientdb/issues/681#issuecomment-28466948.
We discussed serialization of single record (sorry had no time to
analyze
it deeply because a lot of events) but in case of SQL query you have to
process millions of them.
2.  We are working on binary compatibility mechanics too (I mean
compatibility between storage formats), without it current users will not
be able to accomplish new features especially binary serialization.
3.  We have many third party drivers for binary protocol (which pass
serialized records on client;s side) so we have to think how to not break
functionality of this drivers.




On Wed, Feb 19, 2014 at 1:53 PM, Steve <shadders....@gmail.com> wrote:

>  Hi Luca,
>
> I'll give it a go with the real ODB code.  The reason I didn't is because
> I'm actually quite new to ODB even as an end user but your instructions
> will set me in the right direction.  Most of my experience with data
> serialization formats has been with Bitcoin which was mostly for network
> protocol use cases rather than big-data storage.  But that was also a high
> performance scenario so I guess there are a lot of parallels.
>
>
> On 19/02/14 21:33, Luca Garulli wrote:
>
>  Hi Steve,
>  sorry for such delay.
>
>  I like your ideas, I think this is the right direction. varint8 e
> varint16 could be a good way to save space, but we should consider when
> this slows down some use cases, like partial field loading.
>
>  About the POC you created I think it would be much more useful if you
> play with real documents. It's easy and you could push it to a separate
> branch to let to us and other developers to contribute & test. WDYT?
>
>  Follow these steps:
>
>   (1) create your serializer
>
>  This is the skeleton of the class to implement:
>
>  public class BinaryDocumentSerializer implements ORecordSerializer {
>  public static final String NAME = "binarydoc";
>
>          // UN-MARSHALLING
>  public ORecordInternal<?> fromStream(final byte[] iSource) {
>  }
>
>          // PARTIAL UN-MARSHALLING
>  public ORecordInternal<?> fromStream(final byte[] iSource, final
> ORecordInternal<?> iRecord, String[] iFields) {
>  }
>
>          //  MARSHALLING
>  public byte[] toStream(final ORecordInternal<?> iSource, boolean
> iOnlyDelta) {
>  }
>  }
>
>  (2) register your implementation
>
>  ORecordSerializerFactory.instance().register(BinaryDocumentSerializer.NAME,
> new BinaryDocumentSerializer());
>
>  (3) create a new ODocument subclass
>
>  Then create a new class that extends ODocument but uses your
> implementation:
>
>  public class BinaryDocument extends ODocument {
>   protected void setup() {
>     super.setup();
>     _recordFormat =
> ORecordSerializerFactory.instance().getFormat(BinaryDocumentSerializer.NAME);
>   }
>  }
>
>  (4) Try it!
>
>  And now try to create a BinaryDocument, set fields and call .save(). The
> method BinaryDocumentSerializer.toStream() will be called.
>
>
>
>  Lvc@
>
>
>
> On 18 February 2014 06:08, Steve <shadders....@gmail.com> wrote:
>
>>
>>   The point is: why should I store the field name when I've declared
>> that a class has such names?
>>
>>
>>  Precisely.  But I don't think you need to limit it to the declarative
>> case... i.e. schema-full.  By using a numbered field_id you cover
>> schema-full, schema-mixed and schema-free cases with a single solution.
>> There are two issues here... Performance and storage space.  Arguably
>> improving storage space also improves performance in a bigdata context
>> because it allows caches to retain more logical units in memory.
>>
>>
>> I've been having a good think about this and I think I've come up with a
>> viable plan that solves a few problems.  It requires schema versioning.
>>
>> I was hesitant to make this suggestion as it introduces more complexity
>> in order to improve compactness and unnecessary reading of metadata.
>> However I see from you original proposal that the problem exists there as
>> well.:
>>
>> *Cons:*
>>
>>    - *Every time the schema changes, a full scan and update of record is
>>    needed*
>>
>> The proposal is that record metadata is made of 3 parts + a meta-header
>> (which in most cases would be 2-3 bytes.  Fixed length schema declared
>> fields, variable length schema declared fields and schema-less fields.  The
>> problem as you point out with a single schema per class is that if you
>> change the schema you have to update every record. If you insert a field
>> before the last field you would likely have to rewrite every record from
>> scratch.
>>
>> First a couple of definitions:
>>
>> Definitions:
>>
>> varint8: a standard varint that is built from any number of 1 byte
>> segments.  The first bit of each segment is set to 1 if there is a
>> subsequent segment.  A number is constructed by concatenating the last 7
>> bits of each byte.  This allows for the following value ranges:
>> 1 byte : 127
>> 2 bytes: 16k
>> 3 bytes: 2m
>> 4 bytes: 268m
>>
>> varint16: same as varint8 but the first segment is 16 bits and all
>> subsequent are 8 bits
>> 2 bytes: 32k
>> 3 bytes: 4m
>> 4 bytes: 536m
>>
>> nameId: an int (or long) index from a field name array.  This index could
>> be one per JVM or one per class.  Getting the field name using the nameId
>> is a single array lookup.  This is stored on disk as a varint16 allowing
>> 32k names before we need to use a 3rd byte for name storage.
>>
>> I propose a record header that looks like this:
>>
>> version:varint8|header_length:varint8|variable_length_declared_field_headers|undeclared_field_headers
>>
>> Version is the schema version and would in most cases be only 1 byte.
>> You would need 128 schema changes to make it 2 bytes.  This proposal would
>> require a cleanup tool that could scan all record and reset them all to
>> most recent schema version (at which point version is reset to 0).  But it
>> would be necessary on every schema change.  The user could choose if and
>> when to run it.  The only time you would need to do a full scan would be if
>> you are introducing some sort of constraint and needed to validate that
>> existing records don't violate the constraint.
>>
>> When a new schema is generated the user defined order of fields is stored
>> in each field's Schema entry.  Internally the fields are rearranged so that
>> all fixed length fields come first.  Because the order and length of fields
>> is known by the schema there is no need to store offset/length in the
>> record header.
>>
>> Variable length declared fields need only a length and offset and the
>> rest of the field meta data is determined by the schema.
>>
>> Finally undeclared (schema-less) fields require additional header data:
>> nameId:varint16|dataType:byte?|offset:varint8|length:varint8
>>
>> I've attached a very rough partial implementation to try and demonstrate
>> the concept.  It won't run because a number of low level functions aren't
>> implemented but if you start at the Record class you should be able to
>> follow the code through from the read(int nameId) method.  It demonstrates
>> how you would read a schema/fixed, schema/variable and non-schema field
>> from the record using random access.
>>
>> I think I've made one significant mistake in demo code.  I've used
>> varints to store offset/length for schema-variable-length fields.  This
>> means you cannot find the header for one of those field without scanning
>> that entire section of the header.  The same is true for schema-less
>> however in this case it doesn't matter since we don't know what fields are
>> there (or the order) from the schema we have no option but to scan that
>> part of the header to find the field metadata we are looking for.
>>
>> The advantage though of storing length as a varint is that perhaps in a
>> majority of cases field length is going to be less than 127 bytes which
>> means you can store it in a single byte rather than 4 or 8 for an int or
>> long.
>>
>> We have a couple of potential tradeoffs to consider here (only relavent
>> to the schema declared variable length fields).  By doing a full scan of
>> the header we can use varints with impunity and can gain storage benefits
>> from it.  We can also dispense with storing the offset field altogether as
>> it can be calculated during the header scan.  So potentially reducing the
>> header entry for each field from 8 bytes (if you use int) to as little as
>> 1.  Also we remove a potential constraint on maximum field length.  On the
>> other hand if we use fixed length fields (like int or long) to store
>> offset/length we gain random access in the header.
>>
>> I can see two edge cases where this sort of scheme would run into
>> difficulties or potentially create a storage penalty.  1) a dataset that
>> has a vast number of different fields.  Perhaps where the user is for some
>> reason using the field name as a kind of meta-data which would increase the
>> in-memory field_name table and 2) Where a user has adopted the (rather
>> hideous) mongoDB solution of abbreviating field names and taken it to the
>> extreme of a single character field name.  In this case my proposed 16 bit
>> minimum nameIndex size would be 8 bits over what could be achieved.
>>
>> The first issue could be dealt with by only by making the tokenised field
>> name feature available only in the case where the field is declared in
>> schema (basically your proposal).  But would also require a flag on
>> internally stored field_name token to indicate if it's a schema token or
>> schema-less full field name.  It could be mitigated by giving an option for
>> full field_name storage (I would imagine this would be a rare use case).
>>
>> The second issue (if deemed important enough to address) could also be be
>> dealt with by a separate implementation of something like IFieldNameDecoder
>> that uses an 8 bit segment and asking the user to declare a cluster/class
>> as using that if they have a use case for it.
>>
>   --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "OrientDB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to orient-database+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>  --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "OrientDB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to orient-database+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>



-- 
Best regards,
Andrey Lomakin.

Orient Technologies
the Company behind OrientDB

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to orient-database+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: [orientdb] Schema driven serialization #1890

Reply via email to