Re: [orientdb] Schema driven serialization #1890

stefan Tue, 15 Apr 2014 08:17:10 -0700

Hi,

Have you guys discussed a timeline for including this in the distribution?


Regards,
  -Stefan

On Monday, 7 April 2014 12:16:20 UTC, [email protected] wrote:
>
>
> Thank you Steve! 
>
> I look forward to use this or pitch in at later stages.
>
> Regards,
>   -Stefan
>
> On Sunday, 6 April 2014 10:22:17 UTC, Steve Coughlan wrote:
>>
>> I've spent the last few days playing with this and I've just pushed the 
>> results so far to 
>> https://github.com/shadders/orientdb/tree/binary-serialization/binary
>>
>> It needs a lot of work to get it integrated into ODB but it's start and I 
>> wanted to get it up somewhere where the developers can look at it so I can 
>> start asking the question I need to ask to get it to play with Orient-core.
>>
>> Currently I haven't touched orient-core's code so it's all in a seperate 
>> project under the 'binary' directory.  I have tried to align the classes 
>> with Orient's class structure though so I can gradually integrate it.  
>>
>> Tomorrow I will do a proper writeup of where it's at, how I've specified 
>> the format, what questions I need to ask and what barriers I've come across 
>> with the orient internal API.  For now the OBinarySerializer will only work 
>> serializing a document back and forth to an array.  db.save(document) 
>> throws up a few problems which I need to ask questions about.  It also only 
>> handles primitive type OTypes but that it not really a big deal as the 
>> format is quite agnostic to how an individual field is serialized so it's 
>> possibly just a matter of adapting existing serializers or building new 
>> ones for a few OTypes (which doesn't look too hard).
>>
>> I will start with one question though.  My OBinaryDocument class inherits 
>> from ODocument and most constructors match and call super(sameParams) but 
>> for some reason when I save the document doesn't generate an ORID.  Problem 
>> is ODocument.clusterIds is null, but I can't find how they are set.  Any 
>> hints?
>>
>> For now the record format is documented reasonably well in the class 
>> javadoc for ORecordHeader.
>>
>>
>> On 22/03/14 09:00, [email protected] wrote:
>>  
>>  
>>  More here: 
>> https://www.arangodb.org/2012/07/08/collection-disk-usage-arangodb
>>
>> On Friday, 21 March 2014 21:55:28 UTC, [email protected] wrote: 
>>
>> Hi, 
>>
>>  I came across this and found it interesting, perhaps you will too.
>>
>> https://www.arangodb.org/2012/07/11/infographic-comparing-space-usage-mongodb-couchdb-arangodb
>>
>>  Regards,
>>   -Stefán
>>
>> On Tuesday, 18 March 2014 06:58:08 UTC, Steve Coughlan wrote: 
>>
>> Sure... I'll clean it up a bit over the next few days and document a bit 
>> more clearly.  Then I'll push it.  No sense trying to figure out the whole 
>> API by myself when there's people who know it inside out.  Good thinking ;)
>>
>> On 18/03/14 15:28, Luca Garulli wrote:
>>  
>> Hi Steve, 
>> we're still closing 1.7, so we have a few months for 2.0. What I'd like 
>> is a new branch where everybody can look, contribute and compare approaches.
>>
>>  Could you push it on a branch in GitHub? WDYT?
>>
>>  Lvc@
>>
>>  
>>
>> On 18 March 2014 06:21, Steve <[email protected]> wrote:
>>
>>  Hi Luca,
>>
>> Apologies.  I was able to make a PoC ok but I ran into lots of 
>> difficulties trying to integrate it into orient code (mainly to do with 
>> working out how not break binary compaitibility with the existing binary 
>> protocol).  Then work went insanely busy for a while and I haven't got back 
>> to it yet.  What sort of timelines are looking at for v2.0?  I know you are 
>> keen to get something like this into that version and obviously if you use 
>> my work your devs will need plenty of time to review and tweak it.  If you 
>> can tell me how long you have then I can give you an idea whether I think I 
>> can realistically deliver or not.  I would like to do this but I don't want 
>> to leave you waiting for me if you have resources that could do it sooner.
>>
>> regards,
>>
>> Steve 
>>
>>
>>
>> On 18/03/14 15:09, Luca Garulli wrote:
>>   
>>  Hi Steve, 
>> have you had the chance to play with this? Any updates?
>>
>>  Lvc@
>>
>>  
>>
>> On 21 February 2014 19:01, Luca Garulli <[email protected]> wrote:
>>
>>   On 20 February 2014 13:24, Steve <[email protected]> wrote:
>>
>>  Hi Andrey,
>>
>> I forked orient-core today and spent most of the day playing around with 
>> the source trying to work out how to change over my pseudo schema, 
>> property, type classes into OSchema, OProperty, OType.  
>> ORecordSerializerDocument2Binary was very useful for understanding things.  
>> Is it actually in use?  I can't find any references to it.
>>  
>>
>>  AFAIK it's not used. It was just a prototype.
>>   
>>
>> Could you explain *"We have many third party drivers for binary 
>> protocol"* a bit more?  Are there any examples?
>>  
>>
>>  All the binary drivers manage directly the current serialization. The 
>> content is sent in binary for to the client and it has to unmarshall. To 
>> all the binary drivers implemented it.
>>
>>  At the beginning we could marshall the content in old form when we send 
>> it to the clients, based on client protocol version.
>>   
>>
>> I also have a question about ORID and whether it can be considered fixed 
>> length.  It contains OClusterPosition which has two implementations.  One 
>> is 8 bytes long and the other is 24 bytes long.  For the purposes of 
>> serialization we can't consider the ORID to be fixed length unless we 
>> guaruntee that every instance of ORID within a DB is only one of these 
>> implementations.  Is this the case?
>>  
>>
>>  Consider it as fixed length, the longer is not yet used.
>>   
>>
>> At the moment I'm also wrestling with what to do about null fixed length 
>> fields and whether to reserves space inside a record.  Whilst headers are 
>> ordered by schema_fixed_length, schema_variable_length, schema_less fields 
>> there's no reason data needs to follow the same order.  But by default it 
>> probably would.  Consider an object schema like this:
>> class SomeClass {
>>     update_time: DateTime //fixed length
>>     short_string: String
>>     massive_string: String
>> }
>>
>> If we first write the record and update_time is null we'd have something 
>> like this
>> update_time:0 bytes|short_string: 10 bytes|massive_string:100kbytes
>>
>> Then we update it to add update_time we have a few options.
>> 1/ When originally writing the object reserve space even though the value 
>> is null (wasted space)
>> 2/ Search for a hole.  e.g. if short_string has been set to null we could 
>> steal it's space.
>> 3/ Write the update_time field after massive_string (If there is space 
>> before the beginning of the next record).  Potentially we are writing into a
>> different disk block so for future reads when we aren't interested in 
>> massive_string we still have to load the block into memory)
>> 4/ Rewrite the entire record.
>>
>> I suppose it is worth considering whether there's a benefit to reserving 
>> partial holes.  i.e. if we have 10 * 4 byte nullable fixed length fields 
>> (all null on initial write) should we take a guess and reserve say 10 out 
>> of the 40 possible bytes for future updates?  But I'm probably getting 
>> ahead of myself.  I'll work on a simple implementation first before trying 
>> to be too clever ;)
>>
>>
>>  Good question.
>>
>>  I think that reserving space for fixed length fields has the advantage 
>> to keep the fixed size area as is and fixed length fields are usually 
>> small, maximum 8 bytes each.
>>
>>  By the way, datetime now are stored as long, so probably a -1 could 
>> means NULL. We should figure out how to represent NULL on each type.
>>
>>  Lvc@
>>   
>>
>>  
>> On 20/02/14 20:12, Andrey Lomakin wrote:
>>  
>> Hi Steve, 
>> Good that you are going to help us.
>> Few additional information:
>> 1.  We already have binary serialization support you can see it here 
>> com.orientechnologies.common.serialization.types.OBinarySerializer so 
>> obviously we should not have several version of the same. Also I think it 
>> will be interesting for you to look at this issue and discussion here 
>> https://github.com/orientechnologies/orientdb/issues/681#issuecomment-28466948.
>>  We discussed serialization of single record (sorry had no time to analyze 
>> it deeply because a lot of events) but in case of SQL query you have to 
>> process millions of them. 
>> 2.  We are working on binary compatibility mechanics too (I mean 
>> compatibility between storage formats), without it current users will not 
>> be able to accomplish new features especially binary serialization.
>>  3.  We have many third party drivers for binary protocol (which pass 
>> serialized records on client;s side) so we have to think how to not break 
>> functionality of this drivers.
>>
>>   
>>
>> On Wed, Feb 19, 2014 at 1:53 PM, Steve <[email protected]> wrote:
>>
>>  Hi Luca,
>>
>> I'll give it a go with the real ODB code.  The reason I didn't is because 
>> I'm actually quite new to ODB even as an end user but your instructions 
>> will set me in the right direction.  Most of my experience with data 
>> serialization formats has been with Bitcoin which was mostly for network 
>> protocol use cases rather than big-data storage.  But that was also a high 
>> performance scenario so I guess there are a lot of parallels. 
>>
>>
>> On 19/02/14 
>>
>> ...
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [orientdb] Schema driven serialization #1890

Reply via email to