Re: [orientdb] Schema driven serialization #1890

stefan Wed, 19 Feb 2014 08:05:58 -0800

Hi,

Just wanted to chime in an provide additional incurragement.
This would matter a great deal to us.


Regards,
 -Stefan

On Wednesday, 19 February 2014 12:37:06 UTC, Lvc@ wrote:
>
> :-)
>
>
> On 19 February 2014 13:13, Steve Coughlan <[email protected]<javascript:>
> > wrote:
>
>> Flattery will get you everwhere.. lol :)
>> On Feb 19, 2014 10:11 PM, "Luca Garulli" <[email protected] <javascript:>> 
>> wrote:
>>
>>> Hi Steve,
>>> your previous email shows me your skill on this, so I'm confident you 
>>> could give us a big contribution for a faster and more efficient release 
>>> 2.0 ;-)
>>>
>>> Lvc@
>>>
>>>
>>>
>>> On 19 February 2014 12:53, Steve <[email protected] <javascript:>>wrote:
>>>
>>>>  Hi Luca,
>>>>
>>>> I'll give it a go with the real ODB code.  The reason I didn't is 
>>>> because I'm actually quite new to ODB even as an end user but your 
>>>> instructions will set me in the right direction.  Most of my experience 
>>>> with data serialization formats has been with Bitcoin which was mostly for 
>>>> network protocol use cases rather than big-data storage.  But that was 
>>>> also 
>>>> a high performance scenario so I guess there are a lot of parallels.
>>>>
>>>>
>>>> On 19/02/14 21:33, Luca Garulli wrote:
>>>>  
>>>>  Hi Steve,
>>>>  sorry for such delay.
>>>>
>>>>  I like your ideas, I think this is the right direction. varint8 e 
>>>> varint16 could be a good way to save space, but we should consider when 
>>>> this slows down some use cases, like partial field loading.
>>>>
>>>>  About the POC you created I think it would be much more useful if you 
>>>> play with real documents. It's easy and you could push it to a separate 
>>>> branch to let to us and other developers to contribute & test. WDYT?
>>>>
>>>>  Follow these steps:
>>>>
>>>>   (1) create your serializer
>>>>
>>>>  This is the skeleton of the class to implement:
>>>>
>>>>  public class BinaryDocumentSerializer implements ORecordSerializer {
>>>>  public static final String NAME = "binarydoc";
>>>>
>>>>          // UN-MARSHALLING
>>>>  public ORecordInternal<?> fromStream(final byte[] iSource) {
>>>>  }
>>>>  
>>>>          // PARTIAL UN-MARSHALLING
>>>>  public ORecordInternal<?> fromStream(final byte[] iSource, final 
>>>> ORecordInternal<?> iRecord, String[] iFields) {
>>>>  }
>>>>  
>>>>          //  MARSHALLING
>>>>  public byte[] toStream(final ORecordInternal<?> iSource, boolean 
>>>> iOnlyDelta) {
>>>>  }
>>>>  }
>>>>  
>>>>  (2) register your implementation
>>>>
>>>>  
>>>> ORecordSerializerFactory.instance().register(BinaryDocumentSerializer.NAME,
>>>>  
>>>> new BinaryDocumentSerializer());
>>>>
>>>>  (3) create a new ODocument subclass
>>>>  
>>>>  Then create a new class that extends ODocument but uses your 
>>>> implementation:
>>>>
>>>>  public class BinaryDocument extends ODocument {
>>>>   protected void setup() {
>>>>     super.setup();
>>>>     _recordFormat = 
>>>> ORecordSerializerFactory.instance().getFormat(BinaryDocumentSerializer.NAME);
>>>>   }
>>>>  }
>>>>
>>>>  (4) Try it!
>>>>  
>>>>  And now try to create a BinaryDocument, set fields and call .save(). 
>>>> The method BinaryDocumentSerializer.toStream() will be called. 
>>>>  
>>>>  
>>>>  
>>>>  Lvc@
>>>>
>>>>  
>>>>
>>>> On 18 February 2014 06:08, Steve <[email protected] <javascript:>>wrote:
>>>>
>>>>>  
>>>>>   The point is: why should I store the field name when I've declared 
>>>>> that a class has such names?
>>>>>  
>>>>>
>>>>>  Precisely.  But I don't think you need to limit it to the declarative 
>>>>> case... i.e. schema-full.  By using a numbered field_id you cover 
>>>>> schema-full, schema-mixed and schema-free cases with a single solution.   
>>>>> There are two issues here... Performance and storage space.  Arguably 
>>>>> improving storage space also improves performance in a bigdata context 
>>>>> because it allows caches to retain more logical units in memory.
>>>>>
>>>>>
>>>>> I've been having a good think about this and I think I've come up with 
>>>>> a viable plan that solves a few problems.  It requires schema versioning.
>>>>>
>>>>> I was hesitant to make this suggestion as it introduces more 
>>>>> complexity in order to improve compactness and unnecessary reading of 
>>>>> metadata.  However I see from you original proposal that the problem 
>>>>> exists 
>>>>> there as well.:
>>>>>
>>>>> *Cons:*
>>>>>  
>>>>>    - *Every time the schema changes, a full scan and update of record 
>>>>>    is needed* 
>>>>>
>>>>> The proposal is that record metadata is made of 3 parts + a 
>>>>> meta-header (which in most cases would be 2-3 bytes.  Fixed length schema 
>>>>> declared fields, variable length schema declared fields and schema-less 
>>>>> fields.  The problem as you point out with a single schema per class is 
>>>>> that if you change the schema you have to update every record. If you 
>>>>> insert a field before the last field you would likely have to rewrite 
>>>>> every 
>>>>> record from scratch.
>>>>>
>>>>> First a couple of definitions:
>>>>>
>>>>> Definitions:
>>>>>
>>>>> varint8: a standard varint that is built from any number of 1 byte 
>>>>> segments.  The first bit of each segment is set to 1 if there is a 
>>>>> subsequent segment.  A number is constructed by concatenating the last 7 
>>>>> bits of each byte.  This allows for the following value ranges:
>>>>> 1 byte : 127
>>>>> 2 bytes: 16k
>>>>> 3 bytes: 2m
>>>>> 4 bytes: 268m
>>>>>
>>>>> varint16: same as varint8 but the first segment is 16 bits and all 
>>>>> subsequent are 8 bits
>>>>> 2 bytes: 32k
>>>>> 3 bytes: 4m
>>>>> 4 bytes: 536m
>>>>>
>>>>> nameId: an int (or long) index from a field name array.  This index 
>>>>> could be one per JVM or one per class.  Getting the field name using the 
>>>>> nameId is a single array lookup.  This is stored on disk as a varint16 
>>>>> allowing 32k names before we need to use a 3rd byte for name storage.
>>>>>
>>>>> I propose a record header that looks like this:
>>>>>
>>>>> version:varint8|header_length:varint8|variable_length_declared_field_headers|undeclared_field_headers
>>>>>
>>>>> Version is the schema version and would in most cases be only 1 byte.  
>>>>> You would need 128 schema changes to make it 2 bytes.  This proposal 
>>>>> would 
>>>>> require a cleanup tool that could scan all record and reset them all to 
>>>>> most recent schema version (at which point version is reset to 0).  But 
>>>>> it 
>>>>> would be necessary on every schema change.  The user could choose if and 
>>>>> when to run it.  The only time you would need to do a full scan would be 
>>>>> if 
>>>>> you are introducing some sort of constraint and needed to validate that 
>>>>> existing records don't violate the constraint.
>>>>>
>>>>> When a new schema is generated the user defined order of fields is 
>>>>> stored in each field's Schema entry.  Internally the fields are 
>>>>> rearranged 
>>>>> so that all fixed length fields come first.  Because the order and length 
>>>>> of fields is known by the schema there is no need to store offset/length 
>>>>> in 
>>>>> the record header.
>>>>>
>>>>> Variable length declared fields need only a length and offset and the 
>>>>> rest of the field meta data is determined by the schema.
>>>>>
>>>>> Finally undeclared (schema-less) fields require additional header data:
>>>>> nameId:varint16|dataType:byte?|offset:varint8|length:varint8
>>>>>
>>>>> I've attached a very rough partial implementation to try and 
>>>>> demonstrate the concept.  It won't run because a number of low level 
>>>>> functions aren't implemented but if you start at the Record class you 
>>>>> should be able to follow the code through from the read(int nameId) 
>>>>> method.  It demonstrates how you would read a schema/fixed, 
>>>>> schema/variable 
>>>>> and non-schema field from the record using random access.
>>>>>
>>>>> I think I've made one significant mistake in demo code.  I've used 
>>>>> varints to store offset/length for schema-variable-length fields.  This 
>>>>> means you cannot find the header for one of those field without scanning 
>>>>> that entire section of the header.  The same is true for schema-less 
>>>>> however in this case it doesn't matter since we don't know what fields 
>>>>> are 
>>>>> there (or the order) from the schema we have no option but to scan that 
>>>>> part of the header to find the field metadata we are looking for.
>>>>>
>>>>> The advantage though of storing length as a varint is that perhaps in 
>>>>> a majority of cases field length is going to be less than 127 bytes which 
>>>>> means you can store it in a single byte rather than 4 or 8 for an int or 
>>>>> long.  
>>>>>
>>>>> We have a couple of potential tradeoffs to consider here (only 
>>>>> relavent to the schema declared variable length fields).  By doing a full 
>>>>> scan of the header we can use varints with impunity and can gain storage 
>>>>> benefits from it.  We can also dispense with storing the offset field 
>>>>> altogether as it can be calculated during the header scan.  So 
>>>>> potentially 
>>>>> reducing the header entry for each field from 8 bytes (if you use int) to 
>>>>> as little as 1.  Also we remove a potential constraint on maximum field 
>>>>> length.  On the other hand if we use fixed length fields (like int or 
>>>>> long) 
>>>>> to store offset/length we gain random access in the header.
>>>>>
>>>>> I can see two edge cases where this sort of scheme would run into 
>>>>> difficulties or potentially create a storage penalty.  1) a dataset that 
>>>>> has a vast number of different fields.  Perhaps where the user is for 
>>>>> some 
>>>>> reason using the field name as a kind of meta-data which would increase 
>>>>> the 
>>>>> in-memory field_name table and 2) Where a user has adopted the (rather 
>>>>> hideous) mongoDB solution of abbreviating field names and taken it to the 
>>>>> extreme of a single character field name.  In this case my proposed 16 
>>>>> bit 
>>>>> minimum nameIndex size would be 8 bits over what could be achieved.
>>>>>
>>>>> The first issue could be dealt with by only by making the tokenised 
>>>>> field name feature available only in the case where the field is declared 
>>>>> in schema (basically your proposal).  But would also require a flag on 
>>>>> internally stored field_name token to indicate if it's a schema token or 
>>>>> schema-less full field name.  It could be mitigated by giving an option 
>>>>> for 
>>>>> full field_name storage (I would imagine this would be a rare use case).
>>>>>
>>>>> The second issue (if deemed important enough to address) could also be 
>>>>> be dealt with by a separate implementation of something like 
>>>>> IFieldNameDecoder that uses an 8 bit segment and asking the user to 
>>>>> declare 
>>>>> a cluster/class as using that if they have a use case for it.
>>>>>
>>>>   -- 
>>>>  
>>>> --- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "OrientDB" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected] <javascript:>.
>>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>>
>>>>
>>>>  -- 
>>>>  
>>>> --- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "OrientDB" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected] <javascript:>.
>>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>>
>>>
>>>  -- 
>>>  
>>> --- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "OrientDB" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected] <javascript:>.
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>  -- 
>>  
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "OrientDB" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: [orientdb] Schema driven serialization #1890

Reply via email to