Re: [orientdb] Schema driven serialization #1890

stefan Fri, 21 Mar 2014 15:57:06 -0700

Not much details there, sorry, these "shapes" just caught my attention.


On Friday, 21 March 2014 21:55:28 UTC, [email protected] wrote:
>
> Hi,
>
> I came across this and found it interesting, perhaps you will too.
>
> https://www.arangodb.org/2012/07/11/infographic-comparing-space-usage-mongodb-couchdb-arangodb
>
> Regards,
>   -Stefán
>
> On Tuesday, 18 March 2014 06:58:08 UTC, Steve Coughlan wrote:
>>
>> Sure... I'll clean it up a bit over the next few days and document a bit 
>> more clearly.  Then I'll push it.  No sense trying to figure out the whole 
>> API by myself when there's people who know it inside out.  Good thinking ;)
>>
>> On 18/03/14 15:28, Luca Garulli wrote:
>>  
>> Hi Steve, 
>> we're still closing 1.7, so we have a few months for 2.0. What I'd like 
>> is a new branch where everybody can look, contribute and compare approaches.
>>
>>  Could you push it on a branch in GitHub? WDYT?
>>
>>  Lvc@
>>
>>  
>>
>> On 18 March 2014 06:21, Steve <[email protected]> wrote:
>>
>>  Hi Luca,
>>
>> Apologies.  I was able to make a PoC ok but I ran into lots of 
>> difficulties trying to integrate it into orient code (mainly to do with 
>> working out how not break binary compaitibility with the existing binary 
>> protocol).  Then work went insanely busy for a while and I haven't got back 
>> to it yet.  What sort of timelines are looking at for v2.0?  I know you are 
>> keen to get something like this into that version and obviously if you use 
>> my work your devs will need plenty of time to review and tweak it.  If you 
>> can tell me how long you have then I can give you an idea whether I think I 
>> can realistically deliver or not.  I would like to do this but I don't want 
>> to leave you waiting for me if you have resources that could do it sooner.
>>
>> regards,
>>
>> Steve 
>>
>>
>>
>> On 18/03/14 15:09, Luca Garulli wrote:
>>   
>>  Hi Steve, 
>> have you had the chance to play with this? Any updates?
>>
>>  Lvc@
>>
>>  
>>
>> On 21 February 2014 19:01, Luca Garulli <[email protected]> wrote:
>>
>>   On 20 February 2014 13:24, Steve <[email protected]> wrote:
>>
>>  Hi Andrey,
>>
>> I forked orient-core today and spent most of the day playing around with 
>> the source trying to work out how to change over my pseudo schema, 
>> property, type classes into OSchema, OProperty, OType.  
>> ORecordSerializerDocument2Binary was very useful for understanding things.  
>> Is it actually in use?  I can't find any references to it.
>>  
>>
>>  AFAIK it's not used. It was just a prototype.
>>   
>>
>> Could you explain *"We have many third party drivers for binary 
>> protocol"* a bit more?  Are there any examples?
>>  
>>
>>  All the binary drivers manage directly the current serialization. The 
>> content is sent in binary for to the client and it has to unmarshall. To 
>> all the binary drivers implemented it.
>>
>>  At the beginning we could marshall the content in old form when we send 
>> it to the clients, based on client protocol version.
>>   
>>
>> I also have a question about ORID and whether it can be considered fixed 
>> length.  It contains OClusterPosition which has two implementations.  One 
>> is 8 bytes long and the other is 24 bytes long.  For the purposes of 
>> serialization we can't consider the ORID to be fixed length unless we 
>> guaruntee that every instance of ORID within a DB is only one of these 
>> implementations.  Is this the case?
>>  
>>
>>  Consider it as fixed length, the longer is not yet used.
>>   
>>
>> At the moment I'm also wrestling with what to do about null fixed length 
>> fields and whether to reserves space inside a record.  Whilst headers are 
>> ordered by schema_fixed_length, schema_variable_length, schema_less fields 
>> there's no reason data needs to follow the same order.  But by default it 
>> probably would.  Consider an object schema like this:
>> class SomeClass {
>>     update_time: DateTime //fixed length
>>     short_string: String
>>     massive_string: String
>> }
>>
>> If we first write the record and update_time is null we'd have something 
>> like this
>> update_time:0 bytes|short_string: 10 bytes|massive_string:100kbytes
>>
>> Then we update it to add update_time we have a few options.
>> 1/ When originally writing the object reserve space even though the value 
>> is null (wasted space)
>> 2/ Search for a hole.  e.g. if short_string has been set to null we could 
>> steal it's space.
>> 3/ Write the update_time field after massive_string (If there is space 
>> before the beginning of the next record).  Potentially we are writing into a
>> different disk block so for future reads when we aren't interested in 
>> massive_string we still have to load the block into memory)
>> 4/ Rewrite the entire record.
>>
>> I suppose it is worth considering whether there's a benefit to reserving 
>> partial holes.  i.e. if we have 10 * 4 byte nullable fixed length fields 
>> (all null on initial write) should we take a guess and reserve say 10 out 
>> of the 40 possible bytes for future updates?  But I'm probably getting 
>> ahead of myself.  I'll work on a simple implementation first before trying 
>> to be too clever ;)
>>
>>
>>  Good question.
>>
>>  I think that reserving space for fixed length fields has the advantage 
>> to keep the fixed size area as is and fixed length fields are usually 
>> small, maximum 8 bytes each.
>>
>>  By the way, datetime now are stored as long, so probably a -1 could 
>> means NULL. We should figure out how to represent NULL on each type.
>>
>>  Lvc@
>>   
>>
>>  
>> On 20/02/14 20:12, Andrey Lomakin wrote:
>>  
>> Hi Steve, 
>> Good that you are going to help us.
>> Few additional information:
>> 1.  We already have binary serialization support you can see it here 
>> com.orientechnologies.common.serialization.types.OBinarySerializer so 
>> obviously we should not have several version of the same. Also I think it 
>> will be interesting for you to look at this issue and discussion here 
>> https://github.com/orientechnologies/orientdb/issues/681#issuecomment-28466948.
>>  We discussed serialization of single record (sorry had no time to analyze 
>> it deeply because a lot of events) but in case of SQL query you have to 
>> process millions of them. 
>> 2.  We are working on binary compatibility mechanics too (I mean 
>> compatibility between storage formats), without it current users will not 
>> be able to accomplish new features especially binary serialization.
>>  3.  We have many third party drivers for binary protocol (which pass 
>> serialized records on client;s side) so we have to think how to not break 
>> functionality of this drivers.
>>
>>   
>>
>> On Wed, Feb 19, 2014 at 1:53 PM, Steve <[email protected]> wrote:
>>
>>  Hi Luca,
>>
>> I'll give it a go with the real ODB code.  The reason I didn't is because 
>> I'm actually quite new to ODB even as an end user but your instructions 
>> will set me in the right direction.  Most of my experience with data 
>> serialization formats has been with Bitcoin which was mostly for network 
>> protocol use cases rather than big-data storage.  But that was also a high 
>> performance scenario so I guess there are a lot of parallels. 
>>
>>
>> On 19/02/14 21:33, Luca Garulli wrote:
>>  
>>  Hi Steve,
>>  sorry for such delay.
>>
>>  I like your ideas, I think this is the right direction. varint8 e 
>> varint16 could be a good way to save space, but we should consider when 
>> this slows down some use cases, like partial field loading.
>>
>>  About the POC you created I think it would be much more useful if you 
>> play with real documents. It's easy and you could push it to a separate 
>> branch to let to us and other developers to contribute & test. WDYT?
>>
>>  Follow these steps:
>>
>>   (1) create your serializer
>>
>>  This is the skeleton of the class to implement:
>>
>>  public class BinaryDocumentSerializer implements ORecordSerializer {
>>  public static final String NAME = "binarydoc";
>>
>>          // UN-MARSHALLING
>>  public ORecordInternal<?> fromStream(final byte[] iSource) {
>>  }
>>  
>>          // PARTIAL UN-MARSHALLING
>>  public ORecordInternal<?> fromStream(final byte[] iSource, final 
>> ORecordInternal<?> iRecord, String[] iFields) {
>>  }
>>  
>>          //  MARSHALLING
>>  public byte[] toStream(final ORecordInternal<?> iSource, boolean 
>> iOnlyDelta) {
>>  }
>>  }
>>  
>>  (2) register your implementation
>>
>>  ORecordSerializerFactory.instance().register(BinaryDocumentSerializer.NAME, 
>> new BinaryDocumentSerializer());
>>
>>  (3) create a new ODocument subclass
>>  
>>  Then create a new class that extends ODocument but uses your 
>> implementation:
>>
>>  public class BinaryDocument extends ODocument {
>>   protected void setup() {
>>     super.setup();
>>     _recordFormat = 
>> ORecordSerializerFactory.instance().getFormat(BinaryDocumentSerializer.NAME);
>>   }
>>  }
>>
>>  (4) Try it!
>>  
>>  And now try to create a BinaryDocument, set fields and call .save(). 
>> The method BinaryDocumentSerializer.toStream() will be called. 
>>
>> ...
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [orientdb] Schema driven serialization #1890

Reply via email to