:-)
On 19 February 2014 13:13, Steve Coughlan <[email protected]> wrote: > Flattery will get you everwhere.. lol :) > On Feb 19, 2014 10:11 PM, "Luca Garulli" <[email protected]> wrote: > >> Hi Steve, >> your previous email shows me your skill on this, so I'm confident you >> could give us a big contribution for a faster and more efficient release >> 2.0 ;-) >> >> Lvc@ >> >> >> >> On 19 February 2014 12:53, Steve <[email protected]> wrote: >> >>> Hi Luca, >>> >>> I'll give it a go with the real ODB code. The reason I didn't is >>> because I'm actually quite new to ODB even as an end user but your >>> instructions will set me in the right direction. Most of my experience >>> with data serialization formats has been with Bitcoin which was mostly for >>> network protocol use cases rather than big-data storage. But that was also >>> a high performance scenario so I guess there are a lot of parallels. >>> >>> >>> On 19/02/14 21:33, Luca Garulli wrote: >>> >>> Hi Steve, >>> sorry for such delay. >>> >>> I like your ideas, I think this is the right direction. varint8 e >>> varint16 could be a good way to save space, but we should consider when >>> this slows down some use cases, like partial field loading. >>> >>> About the POC you created I think it would be much more useful if you >>> play with real documents. It's easy and you could push it to a separate >>> branch to let to us and other developers to contribute & test. WDYT? >>> >>> Follow these steps: >>> >>> (1) create your serializer >>> >>> This is the skeleton of the class to implement: >>> >>> public class BinaryDocumentSerializer implements ORecordSerializer { >>> public static final String NAME = "binarydoc"; >>> >>> // UN-MARSHALLING >>> public ORecordInternal<?> fromStream(final byte[] iSource) { >>> } >>> >>> // PARTIAL UN-MARSHALLING >>> public ORecordInternal<?> fromStream(final byte[] iSource, final >>> ORecordInternal<?> iRecord, String[] iFields) { >>> } >>> >>> // MARSHALLING >>> public byte[] toStream(final ORecordInternal<?> iSource, boolean >>> iOnlyDelta) { >>> } >>> } >>> >>> (2) register your implementation >>> >>> ORecordSerializerFactory.instance().register(BinaryDocumentSerializer.NAME, >>> new BinaryDocumentSerializer()); >>> >>> (3) create a new ODocument subclass >>> >>> Then create a new class that extends ODocument but uses your >>> implementation: >>> >>> public class BinaryDocument extends ODocument { >>> protected void setup() { >>> super.setup(); >>> _recordFormat = >>> ORecordSerializerFactory.instance().getFormat(BinaryDocumentSerializer.NAME); >>> } >>> } >>> >>> (4) Try it! >>> >>> And now try to create a BinaryDocument, set fields and call .save(). >>> The method BinaryDocumentSerializer.toStream() will be called. >>> >>> >>> >>> Lvc@ >>> >>> >>> >>> On 18 February 2014 06:08, Steve <[email protected]> wrote: >>> >>>> >>>> The point is: why should I store the field name when I've declared >>>> that a class has such names? >>>> >>>> >>>> Precisely. But I don't think you need to limit it to the declarative >>>> case... i.e. schema-full. By using a numbered field_id you cover >>>> schema-full, schema-mixed and schema-free cases with a single solution. >>>> There are two issues here... Performance and storage space. Arguably >>>> improving storage space also improves performance in a bigdata context >>>> because it allows caches to retain more logical units in memory. >>>> >>>> >>>> I've been having a good think about this and I think I've come up with >>>> a viable plan that solves a few problems. It requires schema versioning. >>>> >>>> I was hesitant to make this suggestion as it introduces more complexity >>>> in order to improve compactness and unnecessary reading of metadata. >>>> However I see from you original proposal that the problem exists there as >>>> well.: >>>> >>>> *Cons:* >>>> >>>> - *Every time the schema changes, a full scan and update of record >>>> is needed* >>>> >>>> The proposal is that record metadata is made of 3 parts + a meta-header >>>> (which in most cases would be 2-3 bytes. Fixed length schema declared >>>> fields, variable length schema declared fields and schema-less fields. The >>>> problem as you point out with a single schema per class is that if you >>>> change the schema you have to update every record. If you insert a field >>>> before the last field you would likely have to rewrite every record from >>>> scratch. >>>> >>>> First a couple of definitions: >>>> >>>> Definitions: >>>> >>>> varint8: a standard varint that is built from any number of 1 byte >>>> segments. The first bit of each segment is set to 1 if there is a >>>> subsequent segment. A number is constructed by concatenating the last 7 >>>> bits of each byte. This allows for the following value ranges: >>>> 1 byte : 127 >>>> 2 bytes: 16k >>>> 3 bytes: 2m >>>> 4 bytes: 268m >>>> >>>> varint16: same as varint8 but the first segment is 16 bits and all >>>> subsequent are 8 bits >>>> 2 bytes: 32k >>>> 3 bytes: 4m >>>> 4 bytes: 536m >>>> >>>> nameId: an int (or long) index from a field name array. This index >>>> could be one per JVM or one per class. Getting the field name using the >>>> nameId is a single array lookup. This is stored on disk as a varint16 >>>> allowing 32k names before we need to use a 3rd byte for name storage. >>>> >>>> I propose a record header that looks like this: >>>> >>>> version:varint8|header_length:varint8|variable_length_declared_field_headers|undeclared_field_headers >>>> >>>> Version is the schema version and would in most cases be only 1 byte. >>>> You would need 128 schema changes to make it 2 bytes. This proposal would >>>> require a cleanup tool that could scan all record and reset them all to >>>> most recent schema version (at which point version is reset to 0). But it >>>> would be necessary on every schema change. The user could choose if and >>>> when to run it. The only time you would need to do a full scan would be if >>>> you are introducing some sort of constraint and needed to validate that >>>> existing records don't violate the constraint. >>>> >>>> When a new schema is generated the user defined order of fields is >>>> stored in each field's Schema entry. Internally the fields are rearranged >>>> so that all fixed length fields come first. Because the order and length >>>> of fields is known by the schema there is no need to store offset/length in >>>> the record header. >>>> >>>> Variable length declared fields need only a length and offset and the >>>> rest of the field meta data is determined by the schema. >>>> >>>> Finally undeclared (schema-less) fields require additional header data: >>>> nameId:varint16|dataType:byte?|offset:varint8|length:varint8 >>>> >>>> I've attached a very rough partial implementation to try and >>>> demonstrate the concept. It won't run because a number of low level >>>> functions aren't implemented but if you start at the Record class you >>>> should be able to follow the code through from the read(int nameId) >>>> method. It demonstrates how you would read a schema/fixed, schema/variable >>>> and non-schema field from the record using random access. >>>> >>>> I think I've made one significant mistake in demo code. I've used >>>> varints to store offset/length for schema-variable-length fields. This >>>> means you cannot find the header for one of those field without scanning >>>> that entire section of the header. The same is true for schema-less >>>> however in this case it doesn't matter since we don't know what fields are >>>> there (or the order) from the schema we have no option but to scan that >>>> part of the header to find the field metadata we are looking for. >>>> >>>> The advantage though of storing length as a varint is that perhaps in a >>>> majority of cases field length is going to be less than 127 bytes which >>>> means you can store it in a single byte rather than 4 or 8 for an int or >>>> long. >>>> >>>> We have a couple of potential tradeoffs to consider here (only relavent >>>> to the schema declared variable length fields). By doing a full scan of >>>> the header we can use varints with impunity and can gain storage benefits >>>> from it. We can also dispense with storing the offset field altogether as >>>> it can be calculated during the header scan. So potentially reducing the >>>> header entry for each field from 8 bytes (if you use int) to as little as >>>> 1. Also we remove a potential constraint on maximum field length. On the >>>> other hand if we use fixed length fields (like int or long) to store >>>> offset/length we gain random access in the header. >>>> >>>> I can see two edge cases where this sort of scheme would run into >>>> difficulties or potentially create a storage penalty. 1) a dataset that >>>> has a vast number of different fields. Perhaps where the user is for some >>>> reason using the field name as a kind of meta-data which would increase the >>>> in-memory field_name table and 2) Where a user has adopted the (rather >>>> hideous) mongoDB solution of abbreviating field names and taken it to the >>>> extreme of a single character field name. In this case my proposed 16 bit >>>> minimum nameIndex size would be 8 bits over what could be achieved. >>>> >>>> The first issue could be dealt with by only by making the tokenised >>>> field name feature available only in the case where the field is declared >>>> in schema (basically your proposal). But would also require a flag on >>>> internally stored field_name token to indicate if it's a schema token or >>>> schema-less full field name. It could be mitigated by giving an option for >>>> full field_name storage (I would imagine this would be a rare use case). >>>> >>>> The second issue (if deemed important enough to address) could also be >>>> be dealt with by a separate implementation of something like >>>> IFieldNameDecoder that uses an 8 bit segment and asking the user to declare >>>> a cluster/class as using that if they have a use case for it. >>>> >>> -- >>> >>> --- >>> You received this message because you are subscribed to the Google >>> Groups "OrientDB" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> For more options, visit https://groups.google.com/groups/opt_out. >>> >>> >>> -- >>> >>> --- >>> You received this message because you are subscribed to the Google >>> Groups "OrientDB" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> For more options, visit https://groups.google.com/groups/opt_out. >>> >> >> -- >> >> --- >> You received this message because you are subscribed to the Google Groups >> "OrientDB" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> For more options, visit https://groups.google.com/groups/opt_out. >> > -- > > --- > You received this message because you are subscribed to the Google Groups > "OrientDB" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. > -- --- You received this message because you are subscribed to the Google Groups "OrientDB" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
