Re: [orientdb] Schema driven serialization #1890

stefan Fri, 21 Mar 2014 14:56:14 -0700

Hi,

I came across this and found it interesting, perhaps you will too.
https://www.arangodb.org/2012/07/11/infographic-comparing-space-usage-mongodb-couchdb-arangodb


Regards,
  -Stefán

On Tuesday, 18 March 2014 06:58:08 UTC, Steve Coughlan wrote:
>
> Sure... I'll clean it up a bit over the next few days and document a bit 
> more clearly.  Then I'll push it.  No sense trying to figure out the whole 
> API by myself when there's people who know it inside out.  Good thinking ;)
>
> On 18/03/14 15:28, Luca Garulli wrote:
>  
> Hi Steve, 
> we're still closing 1.7, so we have a few months for 2.0. What I'd like is 
> a new branch where everybody can look, contribute and compare approaches.
>
>  Could you push it on a branch in GitHub? WDYT?
>
>  Lvc@
>
>  
>
> On 18 March 2014 06:21, Steve <[email protected] <javascript:>> wrote:
>
>  Hi Luca,
>
> Apologies.  I was able to make a PoC ok but I ran into lots of 
> difficulties trying to integrate it into orient code (mainly to do with 
> working out how not break binary compaitibility with the existing binary 
> protocol).  Then work went insanely busy for a while and I haven't got back 
> to it yet.  What sort of timelines are looking at for v2.0?  I know you are 
> keen to get something like this into that version and obviously if you use 
> my work your devs will need plenty of time to review and tweak it.  If you 
> can tell me how long you have then I can give you an idea whether I think I 
> can realistically deliver or not.  I would like to do this but I don't want 
> to leave you waiting for me if you have resources that could do it sooner.
>
> regards,
>
> Steve 
>
>
>
> On 18/03/14 15:09, Luca Garulli wrote:
>   
>  Hi Steve, 
> have you had the chance to play with this? Any updates?
>
>  Lvc@
>
>  
>
> On 21 February 2014 19:01, Luca Garulli <[email protected] 
> <javascript:>>wrote:
>
>   On 20 February 2014 13:24, Steve <[email protected] <javascript:>>wrote:
>
>  Hi Andrey,
>
> I forked orient-core today and spent most of the day playing around with 
> the source trying to work out how to change over my pseudo schema, 
> property, type classes into OSchema, OProperty, OType.  
> ORecordSerializerDocument2Binary was very useful for understanding things.  
> Is it actually in use?  I can't find any references to it.
>  
>
>  AFAIK it's not used. It was just a prototype.
>   
>
> Could you explain *"We have many third party drivers for binary protocol"*a 
> bit more?  Are there any examples?
>  
>
>  All the binary drivers manage directly the current serialization. The 
> content is sent in binary for to the client and it has to unmarshall. To 
> all the binary drivers implemented it.
>
>  At the beginning we could marshall the content in old form when we send 
> it to the clients, based on client protocol version.
>   
>
> I also have a question about ORID and whether it can be considered fixed 
> length.  It contains OClusterPosition which has two implementations.  One 
> is 8 bytes long and the other is 24 bytes long.  For the purposes of 
> serialization we can't consider the ORID to be fixed length unless we 
> guaruntee that every instance of ORID within a DB is only one of these 
> implementations.  Is this the case?
>  
>
>  Consider it as fixed length, the longer is not yet used.
>   
>
> At the moment I'm also wrestling with what to do about null fixed length 
> fields and whether to reserves space inside a record.  Whilst headers are 
> ordered by schema_fixed_length, schema_variable_length, schema_less fields 
> there's no reason data needs to follow the same order.  But by default it 
> probably would.  Consider an object schema like this:
> class SomeClass {
>     update_time: DateTime //fixed length
>     short_string: String
>     massive_string: String
> }
>
> If we first write the record and update_time is null we'd have something 
> like this
> update_time:0 bytes|short_string: 10 bytes|massive_string:100kbytes
>
> Then we update it to add update_time we have a few options.
> 1/ When originally writing the object reserve space even though the value 
> is null (wasted space)
> 2/ Search for a hole.  e.g. if short_string has been set to null we could 
> steal it's space.
> 3/ Write the update_time field after massive_string (If there is space 
> before the beginning of the next record).  Potentially we are writing into a
> different disk block so for future reads when we aren't interested in 
> massive_string we still have to load the block into memory)
> 4/ Rewrite the entire record.
>
> I suppose it is worth considering whether there's a benefit to reserving 
> partial holes.  i.e. if we have 10 * 4 byte nullable fixed length fields 
> (all null on initial write) should we take a guess and reserve say 10 out 
> of the 40 possible bytes for future updates?  But I'm probably getting 
> ahead of myself.  I'll work on a simple implementation first before trying 
> to be too clever ;)
>
>
>  Good question.
>
>  I think that reserving space for fixed length fields has the advantage 
> to keep the fixed size area as is and fixed length fields are usually 
> small, maximum 8 bytes each.
>
>  By the way, datetime now are stored as long, so probably a -1 could 
> means NULL. We should figure out how to represent NULL on each type.
>
>  Lvc@
>   
>
>  
> On 20/02/14 20:12, Andrey Lomakin wrote:
>  
> Hi Steve, 
> Good that you are going to help us.
> Few additional information:
> 1.  We already have binary serialization support you can see it here 
> com.orientechnologies.common.serialization.types.OBinarySerializer so 
> obviously we should not have several version of the same. Also I think it 
> will be interesting for you to look at this issue and discussion here 
> https://github.com/orientechnologies/orientdb/issues/681#issuecomment-28466948.
>  We discussed serialization of single record (sorry had no time to analyze 
> it deeply because a lot of events) but in case of SQL query you have to 
> process millions of them. 
> 2.  We are working on binary compatibility mechanics too (I mean 
> compatibility between storage formats), without it current users will not 
> be able to accomplish new features especially binary serialization.
>  3.  We have many third party drivers for binary protocol (which pass 
> serialized records on client;s side) so we have to think how to not break 
> functionality of this drivers.
>
>   
>
> On Wed, Feb 19, 2014 at 1:53 PM, Steve <[email protected] 
> <javascript:>>wrote:
>
>  Hi Luca,
>
> I'll give it a go with the real ODB code.  The reason I didn't is because 
> I'm actually quite new to ODB even as an end user but your instructions 
> will set me in the right direction.  Most of my experience with data 
> serialization formats has been with Bitcoin which was mostly for network 
> protocol use cases rather than big-data storage.  But that was also a high 
> performance scenario so I guess there are a lot of parallels. 
>
>
> On 19/02/14 21:33, Luca Garulli wrote:
>  
>  Hi Steve,
>  sorry for such delay.
>
>  I like your ideas, I think this is the right direction. varint8 e 
> varint16 could be a good way to save space, but we should consider when 
> this slows down some use cases, like partial field loading.
>
>  About the POC you created I think it would be much more useful if you 
> play with real documents. It's easy and you could push it to a separate 
> branch to let to us and other developers to contribute & test. WDYT?
>
>  Follow these steps:
>
>   (1) create your serializer
>
>  This is the skeleton of the class to implement:
>
>  public class BinaryDocumentSerializer implements ORecordSerializer {
>  public static final String NAME = "binarydoc";
>
>          // UN-MARSHALLING
>  public ORecordInternal<?> fromStream(final byte[] iSource) {
>  }
>  
>          // PARTIAL UN-MARSHALLING
>  public ORecordInternal<?> fromStream(final byte[] iSource, final 
> ORecordInternal<?> iRecord, String[] iFields) {
>  }
>  
>          //  MARSHALLING
>  public byte[] toStream(final ORecordInternal<?> iSource, boolean 
> iOnlyDelta) {
>  }
>  }
>  
>  (2) register your implementation
>
>  ORecordSerializerFactory.instance().register(BinaryDocumentSerializer.NAME, 
> new BinaryDocumentSerializer());
>
>  (3) create a new ODocument subclass
>  
>  Then create a new class that extends ODocument but uses your 
> implementation:
>
>  public class BinaryDocument extends ODocument {
>   protected void setup() {
>     super.setup();
>     _recordFormat = 
> ORecordSerializerFactory.instance().getFormat(BinaryDocumentSerializer.NAME);
>   }
>  }
>
>  (4) Try it!
>  
>  And now try to create a BinaryDocument, set fields and call .save(). The 
> method BinaryDocumentSerializer.toStream() will be called. 
>
> ...

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [orientdb] Schema driven serialization #1890

Reply via email to