Re: [orientdb] Schema driven serialization #1890

Steve Mon, 17 Feb 2014 21:09:25 -0800

>  The point is: why should I store the field name when I've declared
> that a class has such names?

Precisely.  But I don't think you need to limit it to the declarative
case... i.e. schema-full.  By using a numbered field_id you cover
schema-full, schema-mixed and schema-free cases with a single solution.
  There are two issues here... Performance and storage space.  Arguably
improving storage space also improves performance in a bigdata context
because it allows caches to retain more logical units in memory.

I've been having a good think about this and I think I've come up with a
viable plan that solves a few problems.  It requires schema versioning.

I was hesitant to make this suggestion as it introduces more complexity
in order to improve compactness and unnecessary reading of metadata. 
However I see from you original proposal that the problem exists there
as well.:

/Cons:/

//

  * /Every time the schema changes, a full scan and update of record is
    needed/

The proposal is that record metadata is made of 3 parts + a meta-header
(which in most cases would be 2-3 bytes.  Fixed length schema declared
fields, variable length schema declared fields and schema-less fields. 
The problem as you point out with a single schema per class is that if
you change the schema you have to update every record. If you insert a
field before the last field you would likely have to rewrite every
record from scratch.

First a couple of definitions:

Definitions:

varint8: a standard varint that is built from any number of 1 byte
segments.  The first bit of each segment is set to 1 if there is a
subsequent segment.  A number is constructed by concatenating the last 7
bits of each byte.  This allows for the following value ranges:
1 byte : 127
2 bytes: 16k
3 bytes: 2m
4 bytes: 268m

varint16: same as varint8 but the first segment is 16 bits and all
subsequent are 8 bits
2 bytes: 32k
3 bytes: 4m
4 bytes: 536m

nameId: an int (or long) index from a field name array.  This index
could be one per JVM or one per class.  Getting the field name using the
nameId is a single array lookup.  This is stored on disk as a varint16
allowing 32k names before we need to use a 3rd byte for name storage.

I propose a record header that looks like this:
version:varint8|header_length:varint8|variable_length_declared_field_headers|undeclared_field_headers

Version is the schema version and would in most cases be only 1 byte. 
You would need 128 schema changes to make it 2 bytes.  This proposal
would require a cleanup tool that could scan all record and reset them
all to most recent schema version (at which point version is reset to
0).  But it would be necessary on every schema change.  The user could
choose if and when to run it.  The only time you would need to do a full
scan would be if you are introducing some sort of constraint and needed
to validate that existing records don't violate the constraint.

When a new schema is generated the user defined order of fields is
stored in each field's Schema entry.  Internally the fields are
rearranged so that all fixed length fields come first.  Because the
order and length of fields is known by the schema there is no need to
store offset/length in the record header.

Variable length declared fields need only a length and offset and the
rest of the field meta data is determined by the schema.

Finally undeclared (schema-less) fields require additional header data:
nameId:varint16|dataType:byte?|offset:varint8|length:varint8

I've attached a very rough partial implementation to try and demonstrate
the concept.  It won't run because a number of low level functions
aren't implemented but if you start at the Record class you should be
able to follow the code through from the read(int nameId) method.  It
demonstrates how you would read a schema/fixed, schema/variable and
non-schema field from the record using random access.

I think I've made one significant mistake in demo code.  I've used
varints to store offset/length for schema-variable-length fields.  This
means you cannot find the header for one of those field without scanning
that entire section of the header.  The same is true for schema-less
however in this case it doesn't matter since we don't know what fields
are there (or the order) from the schema we have no option but to scan
that part of the header to find the field metadata we are looking for.

The advantage though of storing length as a varint is that perhaps in a
majority of cases field length is going to be less than 127 bytes which
means you can store it in a single byte rather than 4 or 8 for an int or
long. 

We have a couple of potential tradeoffs to consider here (only relavent
to the schema declared variable length fields).  By doing a full scan of
the header we can use varints with impunity and can gain storage
benefits from it.  We can also dispense with storing the offset field
altogether as it can be calculated during the header scan.  So
potentially reducing the header entry for each field from 8 bytes (if
you use int) to as little as 1.  Also we remove a potential constraint
on maximum field length.  On the other hand if we use fixed length
fields (like int or long) to store offset/length we gain random access
in the header.

I can see two edge cases where this sort of scheme would run into
difficulties or potentially create a storage penalty.  1) a dataset that
has a vast number of different fields.  Perhaps where the user is for
some reason using the field name as a kind of meta-data which would
increase the in-memory field_name table and 2) Where a user has adopted
the (rather hideous) mongoDB solution of abbreviating field names and
taken it to the extreme of a single character field name.  In this case
my proposed 16 bit minimum nameIndex size would be 8 bits over what
could be achieved.

The first issue could be dealt with by only by making the tokenised
field name feature available only in the case where the field is
declared in schema (basically your proposal).  But would also require a
flag on internally stored field_name token to indicate if it's a schema
token or schema-less full field name.  It could be mitigated by giving
an option for full field_name storage (I would imagine this would be a
rare use case).

The second issue (if deemed important enough to address) could also be
be dealt with by a separate implementation of something like
IFieldNameDecoder that uses an 8 bit segment and asking the user to
declare a cluster/class as using that if they have a use case for it.

>
> persistent class Employee {
>   String name;
>   String surname;
>   int age;
> }
>
> My idea is to assign an id (short or integer) to the property and use
> that id instead of name. This would reduce dramatically the record
> sizes and the memory consumed. We've to figure out a way where:
> - schema-full -> best performance
> - schema-mixed -> uses schema fields when declared, then go schema-free
> - schema-free -> close to now, all the field names are stored in the
> record
>
> Then we're thinking about the best way to store field and values.
>
> Lvc@
>
>
>
>
> On 17 February 2014 12:46, Steve <[email protected]
> <mailto:[email protected]>> wrote:
>
>     Thanks Andrey,
>
>     I'm still convinced that my idea is too simple and too obvious so
>     I must be missing something.
>
>     If I am I'd love someone to tell me what I've missed so I can
>     understand Orient better.  That was the main reason for putting
>     the question.
>
>
>     On 17/02/14 21:31, Andrey Lomakin wrote:
>>     Hi Steve )).
>>     It seems good idea, I will put your comment inside issue.
>>
>>
>>     On Sun, Feb 16, 2014 at 5:53 AM, Steve <[email protected]
>>     <mailto:[email protected]>> wrote:
>>
>>         This is probably going to be a stupid question because the
>>         solution seems so obvious I must have missed something
>>         fundamental.
>>
>>         I found OrientDB when I gave up on MongoDB due the issue of
>>         storing field names in every document (for a lot of my data
>>         the field names are larger than the data itself).  I just
>>         came across issue #1890
>>         <https://github.com/orientechnologies/orientdb/issues/1890>
>>         and happy to see that Orient considers this a priority but I
>>         don't quite understand the need for such a complex approach.
>>
>>         Why not simply maintain an internal index of field names and
>>         store the index?  It wouldn't really matter if you had
>>         different classes with the same field name since the name is
>>         all you are interested in.  To further compact things you
>>         could use a format like google protobufs 'varint' type
>>         
>> <https://developers.google.com/protocol-buffers/docs/encoding#varints>.
>>         If you altered the varint format so the first byte 'grouping'
>>         was 16 bits rather than 8 then you'd have 32k field names
>>         available before needing to expand (which would cover an
>>         awful lot of uses cases).
>>
>>         The lookup would be as trivial as an array lookup and any
>>         overhead would be more than offset by the benefits of being
>>         able to cache many more records in memory due to the space
>>         savings.  Another potential advantage would be that you only
>>         ever use one instance of each field name String and vastly
>>         improve any map lookups that are done internally.  If the
>>         current format writes the actual field name as a string then
>>         every time a field is read it's reading a new string so for
>>         every field * every record where a map lookup is required it
>>         must compute hashcode and run a manual char by char equals().
>>         3 traversals of the string saved on the first lookup (1 for
>>         hashcode and 1 for both strings) and 2 for subsequent lookups.
>>
>>         On the client side I suppose there is the issue of whether
>>         the client should keep the entire lookup table in memory.  It
>>         could be passed portions of it as needed and use something
>>         like a Trove map for lookups.  Not quite as fast as an array
>>         lookup but again I would imagine the savings in memory,
>>         bandwidth etc would more than offset the cost.
>>
>>         I must be missing something?
>>         -- 
>>          
>>         ---
>>         You received this message because you are subscribed to the
>>         Google Groups "OrientDB" group.
>>         To unsubscribe from this group and stop receiving emails from
>>         it, send an email to
>>         [email protected]
>>         <mailto:orient-database%[email protected]>.
>>         For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
>>
>>
>>     -- 
>>     Best regards,
>>     Andrey Lomakin.
>>
>>     Orient Technologies
>>     the Company behind OrientDB
>>
>>     -- 
>>      
>>     ---
>>     You received this message because you are subscribed to the
>>     Google Groups "OrientDB" group.
>>     To unsubscribe from this group and stop receiving emails from it,
>>     send an email to [email protected]
>>     <mailto:[email protected]>.
>>     For more options, visit https://groups.google.com/groups/opt_out.
>
>     -- 
>      
>     ---
>     You received this message because you are subscribed to the Google
>     Groups "OrientDB" group.
>     To unsubscribe from this group and stop receiving emails from it,
>     send an email to [email protected]
>     <mailto:orient-database%[email protected]>.
>     For more options, visit https://groups.google.com/groups/opt_out.
>
>
> -- 
>  
> ---
> You received this message because you are subscribed to the Google
> Groups "OrientDB" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

orient-schema-demo.tar.gz
Description: application/gzip

Re: [orientdb] Schema driven serialization #1890

Reply via email to