Re: [orientdb] Schema Driven Binary Serialization - Strings

Emanuele Tagliaferri Thu, 07 Aug 2014 03:44:29 -0700

Hi,
Yes we have a good progress on this, the first step was to write a 
schemaless binary serialization, and that is done (here 
<https://github.com/orientechnologies/orientdb/wiki/Record-Schemaless-Binary-Serialization>
 
the specs)
The second step was replace the field definition in the record(needed by 
the schemaless) with the one declared in the schema.
The second step is working in progress now, you can check the status in 
this issue: #1890 
<https://github.com/orientechnologies/orientdb/issues/1890>


I will post here when will be done and the new serialization will be 
enabled by default.

On Wednesday, 6 August 2014 21:56:54 UTC+1, Lvc@ wrote:
>
> Hi,
> Absolutely yes! Emanuele is in charge of this. We already have the first 
> version working in 2.0-SNAPSHOT, but we're still working to improve the 
> used space.
>
> Emanuele can be more specific, I think the first public beta of this 
> feature could be next week.
>
> Lvc@
>
>
>
> On 6 August 2014 20:29, Stefán <[email protected] <javascript:>> wrote:
>
>> Hi guys,
>>
>> Have you been able to make some progress on this?
>>
>> Anxiously awaiting :)
>>
>> Best regards,
>>   -Stefan
>>
>>
>> On Thursday, 15 May 2014 09:05:30 UTC, Steve Coughlan wrote:
>>
>>>  >maybe we could use UTF-8/16 as charset as super set of all charsets?
>>>
>>> Which raises the question... Is it safe to assume that UTF-8 IS a 
>>> superset of all charsets?  My lack of charset expertise showing through 
>>> here ;)
>>>
>>>
>>> On 15/05/14 19:02, Luca Garulli wrote:
>>>  
>>>   On 15 May 2014 10:00, Steve <[email protected]> wrote:
>>>
>>>  Is there a way to access this programatically (without having to so a 
>>> db query every time)?
>>>  
>>>
>>>  you can get it by:
>>>
>>>  String charset = db.getStorage().getConfiguration().getCharset()
>>>
>>>  I found OBinarySerializer.bytesToString() and stringToBytes() which 
>>> appears to use single byte encoding for characters where it's possible.  I 
>>> think (but I can't say for certain) that this will result in a charset 
>>> agnostic encoding of each char.
>>>
>>> The other option (the way I normally do this) is to use 
>>> String.getBytes(charset).  Which we could do if there is a global DB 
>>> charset setting however we would run into an issue where if the charset was 
>>> changed we may have to rewrite every string in the database?
>>>
>>>
>>>  You're right, maybe we could use UTF-8/16 as charset as super set of 
>>> all charsets?
>>>
>>>  Lvc@
>>>
>>>   
>>>
>>>  
>>>
>>> On 15/05/14 17:32, Luca Garulli wrote:
>>>  
>>>  Hi Steve,
>>> OrientDB already has a charset setting at database level, to change it:
>>>
>>>  alter database charset utf-8
>>>
>>>  Maybe we could treat char like you did with integer: save the bits if 
>>> the content doesn't use 2 bytes.
>>>  
>>>  Lvc@
>>>
>>>  On 15 May 2014 04:17, Steve <[email protected]> wrote:
>>>
>>>  I'm just adapting the existing binary field serializers to a modified 
>>> interface and looking at the existing OStringSerializer.  I notice it 
>>> serializes char by char (i.e. 2 bytes per char).  Given that under most 
>>> charsets the vast majority of text represented as a single byte I wonder if 
>>> we could handle this safely using String.getBytes(charset).
>>>
>>> The question is, is there a charset that is a superset of all charsets.  
>>> i.e. can we guarantee that the process of serialize/deserialize will never 
>>> lose or alter data.  I'm not really an expert on charsets so I thought I'd 
>>> throw this one out there for input.
>>>
>>> We could specify a charset per cluster or per DB in the way that mysql 
>>> does.  It would be a pain for the user to have to be specifying charsets by 
>>> default.  But if the user is charset aware then we can neatly sidestep this 
>>> issue.
>>>
>>> Any ideas on the best way to handle this?  It would be a shame to double 
>>> the storage size of every string in the DB if it's not necessary.
>>>
>>> On 15/05/14 01:22, Luca Garulli wrote:
>>>  
>>> Hi Steve, 
>>> I guessed you were super busy, no problem about it. Binary Protocol will 
>>> be the first thing Emanuele will work on starting from the end of May. Very 
>>> soon he'll contact you to have some information about last version you 
>>> pushed. He'll help you to integrate your implementation inside OrientDB to 
>>> let all the test cases to pass (thousands).
>>>
>>>  Thanks,
>>> Lvc@
>>>
>>>  
>>>
>>> On 14 May 2014 13:26, Steve <[email protected]> wrote:
>>>
>>>  If I read his last email on the subject correctly he already has.
>>>
>>> Again sorry to Luca for not responding, I missed the email when he sent 
>>> it. 
>>>
>>>
>>>
>>> On 14/05/14 21:19, [email protected] wrote:
>>>  
>>> Hi, 
>>>
>>>  This is good news, now lets hope Luca can find resources for this soon.
>>>
>>>  Regards,
>>>  -Stefán
>>>
>>> On Wednesday, 14 May 2014 11:10:55 UTC, Steve Coughlan wrote: 
>>>
>>>  Hi Stefan,
>>>
>>> Progress has been slow although as I ran into the usual issue, got 
>>> bogged down in issues, became obsessed, ended up spending far more time 
>>> than I expected, got it the shit from my employer for neglecting my work, 
>>> panicked to catch up, never got back to it ;)
>>>
>>> However I did push an update a couple of days ago.  Although many of the 
>>> extra's have not been addressed I'm now able to persist a binary record 
>>> inside orientdb in and retrieve it after a restart (proving that it's 
>>> deserialized from disk not from cache).  Which implies also being able to 
>>> persist the drstically altered schema structure.
>>>
>>> Since I had made the field-level serializer pluggable I've been a 
>>> jackson-json as the serialization mechanism for easy debugging.  Now I need 
>>> to adjust the existing ODB binary serializers.  They all embed data-length 
>>> in the serialized data, which we don't need to do since we store it in 
>>> headers.  And I've adjusted the interface slightly.  So I just need to 
>>> massage the existing binary serializers a little to fit the new interface 
>>> and we will be back to full binary serialization.
>>>
>>> So... some progress, no where near as much as I'd hoped but now that it 
>>> actually works inside ODB (before we could only serialize/deserialize to 
>>> byte arrays using dummy schema objects) I believe it's at a point where we 
>>> can get other ODB developers involved to review/test/contribute.
>>>
>>> I've just noticed a post Luca made a while back that I missed that he'd 
>>> employed someone who'll be focussed on this so I hope we can work together 
>>> on the rest of the integration.  Honestly integration has been the hardest 
>>> part.  I've learned an awful lot about the internals of ODB the hard way 
>>> (apologies for blunt comment but the documentation is awful and it's very 
>>> hard to distinguish what is internal/public API) and also learned I've 
>>> probably only touched a tiny fraction of it.
>>>
>>>
>>> On 14/05/14 19:40, [email protected] wrote:
>>>  
>>> Hi, 
>>>
>>>  Has something newsworthy happened on this?  :)
>>>
>>>  Best regards,
>>>   -Stefán
>>>
>>>
>>> On Friday, 18 April 2014 13:57:07 UTC, Lvc@ wrote: 
>>>
>>>    
>>>  Slightly different issue I think.  I wasn't clear I was actually 
>>> talking versioning of individual class schemas rather than global schema 
>>> version.  This is the part that allows to modify schema and (in some cases) 
>>> avoid having to scan/rewrite all records in the class.  Although this is a 
>>> nice feature to have it's really quite a seperate problem from binary 
>>> serialization so I decided to treat them as seperate issues since trying to 
>>> deal with both at once was really bogging me down.   Looking at your issue 
>>> though I'd note that my subsclasses of OClassImpl and OPropertyImpl are 
>>> actually immutable once constructed so this might help the schema-wide 
>>> immutability.
>>>
>>>
>>>  Good, this would simplify that issue.
>>>  
>>>
>>>     Also realised that per record compression will be rather easy to 
>>> do... But that's in the extras bucket so will leave that as a bonus prize 
>>> once the core functions are sorted and stable.
>>>
>>>
>>>  We already have per record compression, what do you mean? 
>>>   
>>>
>>>  I wasn't aware of this.  Perhaps this occurs in the Raw database layer 
>>> of the code?  I haven't come across any compression code.  If you already 
>>> have per record compression does this negate any potential value to per 
>>> field compression?  i.e. if (string.length > 1000) compressString()
>>>
>>>
>>>  We compress at storage level, but always, not with a threshold. This 
>>> brings to no compression benefits in case of small records, so compression 
>>> at marshalling time would be preferable: drivers could send compressed 
>>> records to improve network I/O.
>>>
>>>  Lvc@
>>>
>>>   
>>>   
>>> </d
>>>
>>> ...
>>
>>  -- 
>>
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "OrientDB" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [orientdb] Schema Driven Binary Serialization - Strings

Reply via email to