Hi, Just wanted to chime in an provide additional incurragement. This would matter a great deal to us.
Regards, -Stefan On Wednesday, 19 February 2014 12:37:06 UTC, Lvc@ wrote: > > :-) > > > On 19 February 2014 13:13, Steve Coughlan <[email protected]<javascript:> > > wrote: > >> Flattery will get you everwhere.. lol :) >> On Feb 19, 2014 10:11 PM, "Luca Garulli" <[email protected] <javascript:>> >> wrote: >> >>> Hi Steve, >>> your previous email shows me your skill on this, so I'm confident you >>> could give us a big contribution for a faster and more efficient release >>> 2.0 ;-) >>> >>> Lvc@ >>> >>> >>> >>> On 19 February 2014 12:53, Steve <[email protected] <javascript:>>wrote: >>> >>>> Hi Luca, >>>> >>>> I'll give it a go with the real ODB code. The reason I didn't is >>>> because I'm actually quite new to ODB even as an end user but your >>>> instructions will set me in the right direction. Most of my experience >>>> with data serialization formats has been with Bitcoin which was mostly for >>>> network protocol use cases rather than big-data storage. But that was >>>> also >>>> a high performance scenario so I guess there are a lot of parallels. >>>> >>>> >>>> On 19/02/14 21:33, Luca Garulli wrote: >>>> >>>> Hi Steve, >>>> sorry for such delay. >>>> >>>> I like your ideas, I think this is the right direction. varint8 e >>>> varint16 could be a good way to save space, but we should consider when >>>> this slows down some use cases, like partial field loading. >>>> >>>> About the POC you created I think it would be much more useful if you >>>> play with real documents. It's easy and you could push it to a separate >>>> branch to let to us and other developers to contribute & test. WDYT? >>>> >>>> Follow these steps: >>>> >>>> (1) create your serializer >>>> >>>> This is the skeleton of the class to implement: >>>> >>>> public class BinaryDocumentSerializer implements ORecordSerializer { >>>> public static final String NAME = "binarydoc"; >>>> >>>> // UN-MARSHALLING >>>> public ORecordInternal<?> fromStream(final byte[] iSource) { >>>> } >>>> >>>> // PARTIAL UN-MARSHALLING >>>> public ORecordInternal<?> fromStream(final byte[] iSource, final >>>> ORecordInternal<?> iRecord, String[] iFields) { >>>> } >>>> >>>> // MARSHALLING >>>> public byte[] toStream(final ORecordInternal<?> iSource, boolean >>>> iOnlyDelta) { >>>> } >>>> } >>>> >>>> (2) register your implementation >>>> >>>> >>>> ORecordSerializerFactory.instance().register(BinaryDocumentSerializer.NAME, >>>> >>>> new BinaryDocumentSerializer()); >>>> >>>> (3) create a new ODocument subclass >>>> >>>> Then create a new class that extends ODocument but uses your >>>> implementation: >>>> >>>> public class BinaryDocument extends ODocument { >>>> protected void setup() { >>>> super.setup(); >>>> _recordFormat = >>>> ORecordSerializerFactory.instance().getFormat(BinaryDocumentSerializer.NAME); >>>> } >>>> } >>>> >>>> (4) Try it! >>>> >>>> And now try to create a BinaryDocument, set fields and call .save(). >>>> The method BinaryDocumentSerializer.toStream() will be called. >>>> >>>> >>>> >>>> Lvc@ >>>> >>>> >>>> >>>> On 18 February 2014 06:08, Steve <[email protected] <javascript:>>wrote: >>>> >>>>> >>>>> The point is: why should I store the field name when I've declared >>>>> that a class has such names? >>>>> >>>>> >>>>> Precisely. But I don't think you need to limit it to the declarative >>>>> case... i.e. schema-full. By using a numbered field_id you cover >>>>> schema-full, schema-mixed and schema-free cases with a single solution. >>>>> There are two issues here... Performance and storage space. Arguably >>>>> improving storage space also improves performance in a bigdata context >>>>> because it allows caches to retain more logical units in memory. >>>>> >>>>> >>>>> I've been having a good think about this and I think I've come up with >>>>> a viable plan that solves a few problems. It requires schema versioning. >>>>> >>>>> I was hesitant to make this suggestion as it introduces more >>>>> complexity in order to improve compactness and unnecessary reading of >>>>> metadata. However I see from you original proposal that the problem >>>>> exists >>>>> there as well.: >>>>> >>>>> *Cons:* >>>>> >>>>> - *Every time the schema changes, a full scan and update of record >>>>> is needed* >>>>> >>>>> The proposal is that record metadata is made of 3 parts + a >>>>> meta-header (which in most cases would be 2-3 bytes. Fixed length schema >>>>> declared fields, variable length schema declared fields and schema-less >>>>> fields. The problem as you point out with a single schema per class is >>>>> that if you change the schema you have to update every record. If you >>>>> insert a field before the last field you would likely have to rewrite >>>>> every >>>>> record from scratch. >>>>> >>>>> First a couple of definitions: >>>>> >>>>> Definitions: >>>>> >>>>> varint8: a standard varint that is built from any number of 1 byte >>>>> segments. The first bit of each segment is set to 1 if there is a >>>>> subsequent segment. A number is constructed by concatenating the last 7 >>>>> bits of each byte. This allows for the following value ranges: >>>>> 1 byte : 127 >>>>> 2 bytes: 16k >>>>> 3 bytes: 2m >>>>> 4 bytes: 268m >>>>> >>>>> varint16: same as varint8 but the first segment is 16 bits and all >>>>> subsequent are 8 bits >>>>> 2 bytes: 32k >>>>> 3 bytes: 4m >>>>> 4 bytes: 536m >>>>> >>>>> nameId: an int (or long) index from a field name array. This index >>>>> could be one per JVM or one per class. Getting the field name using the >>>>> nameId is a single array lookup. This is stored on disk as a varint16 >>>>> allowing 32k names before we need to use a 3rd byte for name storage. >>>>> >>>>> I propose a record header that looks like this: >>>>> >>>>> version:varint8|header_length:varint8|variable_length_declared_field_headers|undeclared_field_headers >>>>> >>>>> Version is the schema version and would in most cases be only 1 byte. >>>>> You would need 128 schema changes to make it 2 bytes. This proposal >>>>> would >>>>> require a cleanup tool that could scan all record and reset them all to >>>>> most recent schema version (at which point version is reset to 0). But >>>>> it >>>>> would be necessary on every schema change. The user could choose if and >>>>> when to run it. The only time you would need to do a full scan would be >>>>> if >>>>> you are introducing some sort of constraint and needed to validate that >>>>> existing records don't violate the constraint. >>>>> >>>>> When a new schema is generated the user defined order of fields is >>>>> stored in each field's Schema entry. Internally the fields are >>>>> rearranged >>>>> so that all fixed length fields come first. Because the order and length >>>>> of fields is known by the schema there is no need to store offset/length >>>>> in >>>>> the record header. >>>>> >>>>> Variable length declared fields need only a length and offset and the >>>>> rest of the field meta data is determined by the schema. >>>>> >>>>> Finally undeclared (schema-less) fields require additional header data: >>>>> nameId:varint16|dataType:byte?|offset:varint8|length:varint8 >>>>> >>>>> I've attached a very rough partial implementation to try and >>>>> demonstrate the concept. It won't run because a number of low level >>>>> functions aren't implemented but if you start at the Record class you >>>>> should be able to follow the code through from the read(int nameId) >>>>> method. It demonstrates how you would read a schema/fixed, >>>>> schema/variable >>>>> and non-schema field from the record using random access. >>>>> >>>>> I think I've made one significant mistake in demo code. I've used >>>>> varints to store offset/length for schema-variable-length fields. This >>>>> means you cannot find the header for one of those field without scanning >>>>> that entire section of the header. The same is true for schema-less >>>>> however in this case it doesn't matter since we don't know what fields >>>>> are >>>>> there (or the order) from the schema we have no option but to scan that >>>>> part of the header to find the field metadata we are looking for. >>>>> >>>>> The advantage though of storing length as a varint is that perhaps in >>>>> a majority of cases field length is going to be less than 127 bytes which >>>>> means you can store it in a single byte rather than 4 or 8 for an int or >>>>> long. >>>>> >>>>> We have a couple of potential tradeoffs to consider here (only >>>>> relavent to the schema declared variable length fields). By doing a full >>>>> scan of the header we can use varints with impunity and can gain storage >>>>> benefits from it. We can also dispense with storing the offset field >>>>> altogether as it can be calculated during the header scan. So >>>>> potentially >>>>> reducing the header entry for each field from 8 bytes (if you use int) to >>>>> as little as 1. Also we remove a potential constraint on maximum field >>>>> length. On the other hand if we use fixed length fields (like int or >>>>> long) >>>>> to store offset/length we gain random access in the header. >>>>> >>>>> I can see two edge cases where this sort of scheme would run into >>>>> difficulties or potentially create a storage penalty. 1) a dataset that >>>>> has a vast number of different fields. Perhaps where the user is for >>>>> some >>>>> reason using the field name as a kind of meta-data which would increase >>>>> the >>>>> in-memory field_name table and 2) Where a user has adopted the (rather >>>>> hideous) mongoDB solution of abbreviating field names and taken it to the >>>>> extreme of a single character field name. In this case my proposed 16 >>>>> bit >>>>> minimum nameIndex size would be 8 bits over what could be achieved. >>>>> >>>>> The first issue could be dealt with by only by making the tokenised >>>>> field name feature available only in the case where the field is declared >>>>> in schema (basically your proposal). But would also require a flag on >>>>> internally stored field_name token to indicate if it's a schema token or >>>>> schema-less full field name. It could be mitigated by giving an option >>>>> for >>>>> full field_name storage (I would imagine this would be a rare use case). >>>>> >>>>> The second issue (if deemed important enough to address) could also be >>>>> be dealt with by a separate implementation of something like >>>>> IFieldNameDecoder that uses an 8 bit segment and asking the user to >>>>> declare >>>>> a cluster/class as using that if they have a use case for it. >>>>> >>>> -- >>>> >>>> --- >>>> You received this message because you are subscribed to the Google >>>> Groups "OrientDB" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected] <javascript:>. >>>> For more options, visit https://groups.google.com/groups/opt_out. >>>> >>>> >>>> -- >>>> >>>> --- >>>> You received this message because you are subscribed to the Google >>>> Groups "OrientDB" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected] <javascript:>. >>>> For more options, visit https://groups.google.com/groups/opt_out. >>>> >>> >>> -- >>> >>> --- >>> You received this message because you are subscribed to the Google >>> Groups "OrientDB" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected] <javascript:>. >>> For more options, visit https://groups.google.com/groups/opt_out. >>> >> -- >> >> --- >> You received this message because you are subscribed to the Google Groups >> "OrientDB" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> For more options, visit https://groups.google.com/groups/opt_out. >> > > -- --- You received this message because you are subscribed to the Google Groups "OrientDB" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
