Re: [orientdb] Schema Driven Binary Serialization - Strings

Luca Garulli Wed, 06 Aug 2014 13:57:01 -0700

Hi,
Absolutely yes! Emanuele is in charge of this. We already have the first
version working in 2.0-SNAPSHOT, but we're still working to improve the
used space.


Emanuele can be more specific, I think the first public beta of this
feature could be next week.

Lvc@



On 6 August 2014 20:29, Stefán <[email protected]> wrote:

> Hi guys,
>
> Have you been able to make some progress on this?
>
> Anxiously awaiting :)
>
> Best regards,
>   -Stefan
>
>
> On Thursday, 15 May 2014 09:05:30 UTC, Steve Coughlan wrote:
>
>>  >maybe we could use UTF-8/16 as charset as super set of all charsets?
>>
>> Which raises the question... Is it safe to assume that UTF-8 IS a
>> superset of all charsets?  My lack of charset expertise showing through
>> here ;)
>>
>>
>> On 15/05/14 19:02, Luca Garulli wrote:
>>
>>   On 15 May 2014 10:00, Steve <[email protected]> wrote:
>>
>>  Is there a way to access this programatically (without having to so a db
>> query every time)?
>>
>>
>>  you can get it by:
>>
>>  String charset = db.getStorage().getConfiguration().getCharset()
>>
>>  I found OBinarySerializer.bytesToString() and stringToBytes() which
>> appears to use single byte encoding for characters where it's possible.  I
>> think (but I can't say for certain) that this will result in a charset
>> agnostic encoding of each char.
>>
>> The other option (the way I normally do this) is to use
>> String.getBytes(charset).  Which we could do if there is a global DB
>> charset setting however we would run into an issue where if the charset was
>> changed we may have to rewrite every string in the database?
>>
>>
>>  You're right, maybe we could use UTF-8/16 as charset as super set of
>> all charsets?
>>
>>  Lvc@
>>
>>
>>
>>
>>
>> On 15/05/14 17:32, Luca Garulli wrote:
>>
>>  Hi Steve,
>> OrientDB already has a charset setting at database level, to change it:
>>
>>  alter database charset utf-8
>>
>>  Maybe we could treat char like you did with integer: save the bits if
>> the content doesn't use 2 bytes.
>>
>>  Lvc@
>>
>>  On 15 May 2014 04:17, Steve <[email protected]> wrote:
>>
>>  I'm just adapting the existing binary field serializers to a modified
>> interface and looking at the existing OStringSerializer.  I notice it
>> serializes char by char (i.e. 2 bytes per char).  Given that under most
>> charsets the vast majority of text represented as a single byte I wonder if
>> we could handle this safely using String.getBytes(charset).
>>
>> The question is, is there a charset that is a superset of all charsets.
>> i.e. can we guarantee that the process of serialize/deserialize will never
>> lose or alter data.  I'm not really an expert on charsets so I thought I'd
>> throw this one out there for input.
>>
>> We could specify a charset per cluster or per DB in the way that mysql
>> does.  It would be a pain for the user to have to be specifying charsets by
>> default.  But if the user is charset aware then we can neatly sidestep this
>> issue.
>>
>> Any ideas on the best way to handle this?  It would be a shame to double
>> the storage size of every string in the DB if it's not necessary.
>>
>> On 15/05/14 01:22, Luca Garulli wrote:
>>
>> Hi Steve,
>> I guessed you were super busy, no problem about it. Binary Protocol will
>> be the first thing Emanuele will work on starting from the end of May. Very
>> soon he'll contact you to have some information about last version you
>> pushed. He'll help you to integrate your implementation inside OrientDB to
>> let all the test cases to pass (thousands).
>>
>>  Thanks,
>> Lvc@
>>
>>
>>
>> On 14 May 2014 13:26, Steve <[email protected]> wrote:
>>
>>  If I read his last email on the subject correctly he already has.
>>
>> Again sorry to Luca for not responding, I missed the email when he sent
>> it.
>>
>>
>>
>> On 14/05/14 21:19, [email protected] wrote:
>>
>> Hi,
>>
>>  This is good news, now lets hope Luca can find resources for this soon.
>>
>>  Regards,
>>  -Stefán
>>
>> On Wednesday, 14 May 2014 11:10:55 UTC, Steve Coughlan wrote:
>>
>>  Hi Stefan,
>>
>> Progress has been slow although as I ran into the usual issue, got bogged
>> down in issues, became obsessed, ended up spending far more time than I
>> expected, got it the shit from my employer for neglecting my work, panicked
>> to catch up, never got back to it ;)
>>
>> However I did push an update a couple of days ago.  Although many of the
>> extra's have not been addressed I'm now able to persist a binary record
>> inside orientdb in and retrieve it after a restart (proving that it's
>> deserialized from disk not from cache).  Which implies also being able to
>> persist the drstically altered schema structure.
>>
>> Since I had made the field-level serializer pluggable I've been a
>> jackson-json as the serialization mechanism for easy debugging.  Now I need
>> to adjust the existing ODB binary serializers.  They all embed data-length
>> in the serialized data, which we don't need to do since we store it in
>> headers.  And I've adjusted the interface slightly.  So I just need to
>> massage the existing binary serializers a little to fit the new interface
>> and we will be back to full binary serialization.
>>
>> So... some progress, no where near as much as I'd hoped but now that it
>> actually works inside ODB (before we could only serialize/deserialize to
>> byte arrays using dummy schema objects) I believe it's at a point where we
>> can get other ODB developers involved to review/test/contribute.
>>
>> I've just noticed a post Luca made a while back that I missed that he'd
>> employed someone who'll be focussed on this so I hope we can work together
>> on the rest of the integration.  Honestly integration has been the hardest
>> part.  I've learned an awful lot about the internals of ODB the hard way
>> (apologies for blunt comment but the documentation is awful and it's very
>> hard to distinguish what is internal/public API) and also learned I've
>> probably only touched a tiny fraction of it.
>>
>>
>> On 14/05/14 19:40, [email protected] wrote:
>>
>> Hi,
>>
>>  Has something newsworthy happened on this?  :)
>>
>>  Best regards,
>>   -Stefán
>>
>>
>> On Friday, 18 April 2014 13:57:07 UTC, Lvc@ wrote:
>>
>>
>>  Slightly different issue I think.  I wasn't clear I was actually talking
>> versioning of individual class schemas rather than global schema version.
>> This is the part that allows to modify schema and (in some cases) avoid
>> having to scan/rewrite all records in the class.  Although this is a nice
>> feature to have it's really quite a seperate problem from binary
>> serialization so I decided to treat them as seperate issues since trying to
>> deal with both at once was really bogging me down.   Looking at your issue
>> though I'd note that my subsclasses of OClassImpl and OPropertyImpl are
>> actually immutable once constructed so this might help the schema-wide
>> immutability.
>>
>>
>>  Good, this would simplify that issue.
>>
>>
>>     Also realised that per record compression will be rather easy to
>> do... But that's in the extras bucket so will leave that as a bonus prize
>> once the core functions are sorted and stable.
>>
>>
>>  We already have per record compression, what do you mean?
>>
>>
>>  I wasn't aware of this.  Perhaps this occurs in the Raw database layer
>> of the code?  I haven't come across any compression code.  If you already
>> have per record compression does this negate any potential value to per
>> field compression?  i.e. if (string.length > 1000) compressString()
>>
>>
>>  We compress at storage level, but always, not with a threshold. This
>> brings to no compression benefits in case of small records, so compression
>> at marshalling time would be preferable: drivers could send compressed
>> records to improve network I/O.
>>
>>  Lvc@
>>
>>
>>
>> </d
>>
>> ...
>
>  --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "OrientDB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [orientdb] Schema Driven Binary Serialization - Strings

Reply via email to