Re: [orientdb] Schema Driven Binary Serialization - Strings

stefan Fri, 08 Aug 2014 14:21:33 -0700

Hi,

Do you have any numbers already on expected/estimated space saving?


Regards,
  -Stefán

On Friday, August 8, 2014 9:23:14 AM UTC, [email protected] wrote:
>
> Great, thank you both! 
>
> On Thursday, 7 August 2014 10:43:21 UTC, Emanuele wrote:
>
>
> Hi,
> Yes we have a good progress on this, the first step was to write a 
> schemaless binary serialization, and that is done (here 
> <https://github.com/orientechnologies/orientdb/wiki/Record-Schemaless-Binary-Serialization>
>  
> the specs)
> The second step was replace the field definition in the record(needed by 
> the schemaless) with the one declared in the schema.
> The second step is working in progress now, you can check the status in 
> this issue: #1890 
> <https://github.com/orientechnologies/orientdb/issues/1890>
>
> I will post here when will be done and the new serialization will be 
> enabled by default.
>
> On Wednesday, 6 August 2014 21:56:54 UTC+1, Lvc@ wrote:
>
> Hi,
> Absolutely yes! Emanuele is in charge of this. We already have the first 
> version working in 2.0-SNAPSHOT, but we're still working to improve the 
> used space.
>
> Emanuele can be more specific, I think the first public beta of this 
> feature could be next week.
>
> Lvc@
>
>
>
> On 6 August 2014 20:29, Stefán <[email protected]> wrote:
>
> Hi guys,
>
> Have you been able to make some progress on this?
>
> Anxiously awaiting :)
>
> Best regards,
>   -Stefan
>
>
> On Thursday, 15 May 2014 09:05:30 UTC, Steve Coughlan wrote:
>
>  >maybe we could use UTF-8/16 as charset as super set of all charsets?
>
> Which raises the question... Is it safe to assume that UTF-8 IS a superset 
> of all charsets?  My lack of charset expertise showing through here ;)
>
>
> On 15/05/14 19:02, Luca Garulli wrote:
>  
>   On 15 May 2014 10:00, Steve <[email protected]> wrote:
>
>  Is there a way to access this programatically (without having to so a db 
> query every time)?
>  
>
>  you can get it by:
>
>  String charset = db.getStorage().getConfiguration().getCharset()
>
>  I found OBinarySerializer.bytesToString() and stringToBytes() which 
> appears to use single byte encoding for characters where it's possible.  I 
> think (but I can't say for certain) that this will result in a charset 
> agnostic encoding of each char.
>
> The other option (the way I normally do this) is to use 
> String.getBytes(charset).  Which we could do if there is a global DB 
> charset setting however we would run into an issue where if the charset was 
> changed we may have to rewrite every string in the database?
>
>
>  You're right, maybe we could use UTF-8/16 as charset as super set of all 
> charsets?
>
>  Lvc@
>
>   
>
>  
>
> On 15/05/14 17:32, Luca Garulli wrote:
>  
>  Hi Steve,
> OrientDB already has a charset setting at database level, to change it:
>
>  alter database charset utf-8
>
>  Maybe we could treat char like you did with integer: save the bits if 
> the content doesn't use 2 bytes.
>  
>  Lvc@
>
>  On 15 May 2014 04:17, Steve <[email protected]> wrote:
>
>  I'm just adapting the existing binary field serializers to a modified 
> interface and looking at the existing OStringSerializer.  I notice it 
> serializes char by char (i.e. 2 bytes per char).  Given that under most 
> charsets the vast majority of text represented as a single byte I wonder if 
> we could handle this safely using String.getBytes(charset).
>
> The question is, is there a charset that is a superset of all charsets.  
> i.e. can we guarantee that the process of serialize/deserialize will never 
> lose or alter data.  I'm not really an expert on charsets so I thought I'd 
> throw this one out there for input.
>
> We could specify a charset per cluster or per DB in the way that mysql 
> does.  It would be a pain for the user to have to be specifying charsets by 
> default.  But if the user is charset aware then we can neatly sidestep this 
> issue.
>
> Any ideas on the best way to handle this?  It would be a shame to double 
> the storage size of every string in the DB if it's not necessary.
>
> On 15/05/14 01:22, Luca Garulli wrote:
>  
> Hi Steve, 
> I guessed you were super busy, no problem about it. Binary Protocol will 
> be the first thing Emanuele will work on starting from the end of May. Very 
> soon he'll contact you to have some information about last version you 
> pushed. He'll help you to integrate your implementation inside OrientDB to 
> let all the test cases to pass (thousands).
>
>  Thanks,
> Lvc@
>
>  
>
> On 14 May 2014 13:26, Steve <[email protected]> wrote:
>
>  If I read his last email on the subject correctly he already has.
>
> Again sorry to Luca for not responding, I missed the email when he sent 
> it. 
>
>
>
> On 14/05/14 21:19, [email protected] wrote:
>  
> Hi, 
>
>  This is good news, now lets hope Luca can find resources for this soon.
>
>  Regards,
>  -Stefán
>
> On Wednesday, 14 May 2014 11:10:55 UTC, Steve Coughlan wrote: 
>
>  Hi Stefan,
>
> Progress has been slow although as I ran into the usual issue, got bogged 
> down in issues, became obsessed, ended up spending far more time than I 
> expected, got it the shit from my employer for neglecting my work, panicked 
> to catch up, never got back to it ;)
>
> However I did push an update a couple of days ago.  Although many of the 
> extra's have not been addressed I'm now able to persist a binary record 
> inside orientdb in and retrieve it after a restart (proving that it's 
> deserialized from disk not from cache).  Which implies also being able to 
> persist the drstically altered schema structure.
>
> Since I had made the field-level serializer pluggable I've been a 
> jackson-json as the serialization mechanism for easy debugging.  Now I need 
> to adjust the existing ODB binary serializers.  They all embed data-length 
> in the serialized data, which we don't need to do since we store it in 
> headers.  And I've adjusted the interface slightly.  So I just need to 
> massage the existing binary serializers a little to fit the new interface 
> and we will be back to full binary serialization.
>
> So... some progress, no where near as much as I'd hoped but now that it 
> actually works inside ODB (before we could only serialize/deserialize to 
> byte arrays using dummy schema objects) I believe it's at a point where we 
> can get other ODB developers involved to review/test/contribute.
>
> I've just noticed a post Luca made a while back that I missed that he'd 
> employed someone who'll be focussed on this so I hope we can work together 
> on the rest of the integration.  Honestly integration has been the hardest 
> part.  I've learned an awful lot about the internals of ODB the hard way 
> (apologies for blunt comment but the documentation is awful and it's very 
> hard to distinguish what is internal/public API) and also learned I've 
> probably only touched a tiny fraction of it.
>
>
> On 14/05/14 19:40, [email protected] wrote:
>  
> Hi, 
>
>  Has something newsworthy happened on this?  :)
>
>  Best regards,
>   -Stefán
>
>
> On Friday, 18 April 2014 13:57:07 UTC, Lvc@ wrote: 
>
>    
>  Slightly different issue I think.  I wasn't clear I was actually talking 
> versioning of individual class schemas rather than global schema version.  
> This is the part that allows to modify schema and (in some cases) avoid 
> having to scan/rewrite all records in the class.  Although this is a nice 
> feature to have it's really quite a seperate problem from binary 
> serialization so I decided to treat them as seperate issues since trying to 
> deal with both at once was really bogging me down.   Looking at your issue 
> though I'd note that my subsclasses of OClassImpl and OPropertyImpl are 
> actually immutable once constructed so this might help the schema-wide 
> immutability.
>
>
>  Good, this would simplify that issue.
>  
>
>     Also realised that per record compression will be rather easy to 
> do... But that's in the extras bucket so will leave that as a bonus prize 
> once the core functions are sorted and stable.
>
>
>  We already have per record compression, what do you mean? 
>   
>
>  I wasn't aware of this.  Perhaps this occurs in the Raw database layer of 
> the code?  I haven't come across any compression code.  If you already have 
> per record compression does this negate any potential value to per field 
> compression?  i.e. if (string.length > 1000) compressString()
>
>
>  We compress at storage level, but always, not with a threshold. This 
> brings to no compression benefits in case of small records, so compression 
> at marshalling time would be preferable: drivers could send compressed 
> records to improve network I/O.
>
>  Lvc@
>
>   
>   
> </d
>
> </blockq
>
> ...

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [orientdb] Schema Driven Binary Serialization - Strings

Reply via email to