Re: [orientdb] Schema Driven Binary Serialization - Strings

Luca Garulli Thu, 15 May 2014 02:04:18 -0700

On 15 May 2014 10:00, Steve <[email protected]> wrote:

>  Is there a way to access this programatically (without having to so a db
> query every time)?
>


you can get it by:

String charset = db.getStorage().getConfiguration().getCharset()

I found OBinarySerializer.bytesToString() and stringToBytes() which appears
> to use single byte encoding for characters where it's possible.  I think
> (but I can't say for certain) that this will result in a charset agnostic
> encoding of each char.
>
> The other option (the way I normally do this) is to use
> String.getBytes(charset).  Which we could do if there is a global DB
> charset setting however we would run into an issue where if the charset was
> changed we may have to rewrite every string in the database?
>

You're right, maybe we could use UTF-8/16 as charset as super set of all
charsets?

Lvc@



>
>
> On 15/05/14 17:32, Luca Garulli wrote:
>
>  Hi Steve,
> OrientDB already has a charset setting at database level, to change it:
>
>  alter database charset utf-8
>
>  Maybe we could treat char like you did with integer: save the bits if
> the content doesn't use 2 bytes.
>
>  Lvc@
>
>  On 15 May 2014 04:17, Steve <[email protected]> wrote:
>
>>  I'm just adapting the existing binary field serializers to a modified
>> interface and looking at the existing OStringSerializer.  I notice it
>> serializes char by char (i.e. 2 bytes per char).  Given that under most
>> charsets the vast majority of text represented as a single byte I wonder if
>> we could handle this safely using String.getBytes(charset).
>>
>> The question is, is there a charset that is a superset of all charsets.
>> i.e. can we guarantee that the process of serialize/deserialize will never
>> lose or alter data.  I'm not really an expert on charsets so I thought I'd
>> throw this one out there for input.
>>
>> We could specify a charset per cluster or per DB in the way that mysql
>> does.  It would be a pain for the user to have to be specifying charsets by
>> default.  But if the user is charset aware then we can neatly sidestep this
>> issue.
>>
>> Any ideas on the best way to handle this?  It would be a shame to double
>> the storage size of every string in the DB if it's not necessary.
>>
>> On 15/05/14 01:22, Luca Garulli wrote:
>>
>> Hi Steve,
>> I guessed you were super busy, no problem about it. Binary Protocol will
>> be the first thing Emanuele will work on starting from the end of May. Very
>> soon he'll contact you to have some information about last version you
>> pushed. He'll help you to integrate your implementation inside OrientDB to
>> let all the test cases to pass (thousands).
>>
>>  Thanks,
>> Lvc@
>>
>>
>>
>> On 14 May 2014 13:26, Steve <[email protected]> wrote:
>>
>>>  If I read his last email on the subject correctly he already has.
>>>
>>> Again sorry to Luca for not responding, I missed the email when he sent
>>> it.
>>>
>>>
>>>
>>> On 14/05/14 21:19, [email protected] wrote:
>>>
>>> Hi,
>>>
>>>  This is good news, now lets hope Luca can find resources for this soon.
>>>
>>>  Regards,
>>>  -Stefán
>>>
>>> On Wednesday, 14 May 2014 11:10:55 UTC, Steve Coughlan wrote:
>>>>
>>>>  Hi Stefan,
>>>>
>>>> Progress has been slow although as I ran into the usual issue, got
>>>> bogged down in issues, became obsessed, ended up spending far more time
>>>> than I expected, got it the shit from my employer for neglecting my work,
>>>> panicked to catch up, never got back to it ;)
>>>>
>>>> However I did push an update a couple of days ago.  Although many of
>>>> the extra's have not been addressed I'm now able to persist a binary record
>>>> inside orientdb in and retrieve it after a restart (proving that it's
>>>> deserialized from disk not from cache).  Which implies also being able to
>>>> persist the drstically altered schema structure.
>>>>
>>>> Since I had made the field-level serializer pluggable I've been a
>>>> jackson-json as the serialization mechanism for easy debugging.  Now I need
>>>> to adjust the existing ODB binary serializers.  They all embed data-length
>>>> in the serialized data, which we don't need to do since we store it in
>>>> headers.  And I've adjusted the interface slightly.  So I just need to
>>>> massage the existing binary serializers a little to fit the new interface
>>>> and we will be back to full binary serialization.
>>>>
>>>> So... some progress, no where near as much as I'd hoped but now that it
>>>> actually works inside ODB (before we could only serialize/deserialize to
>>>> byte arrays using dummy schema objects) I believe it's at a point where we
>>>> can get other ODB developers involved to review/test/contribute.
>>>>
>>>> I've just noticed a post Luca made a while back that I missed that he'd
>>>> employed someone who'll be focussed on this so I hope we can work together
>>>> on the rest of the integration.  Honestly integration has been the hardest
>>>> part.  I've learned an awful lot about the internals of ODB the hard way
>>>> (apologies for blunt comment but the documentation is awful and it's very
>>>> hard to distinguish what is internal/public API) and also learned I've
>>>> probably only touched a tiny fraction of it.
>>>>
>>>>
>>>> On 14/05/14 19:40, [email protected] wrote:
>>>>
>>>> Hi,
>>>>
>>>>  Has something newsworthy happened on this?  :)
>>>>
>>>>  Best regards,
>>>>   -Stefán
>>>>
>>>>
>>>> On Friday, 18 April 2014 13:57:07 UTC, Lvc@ wrote:
>>>>>
>>>>>
>>>>>>  Slightly different issue I think.  I wasn't clear I was actually
>>>>>> talking versioning of individual class schemas rather than global schema
>>>>>> version.  This is the part that allows to modify schema and (in some 
>>>>>> cases)
>>>>>> avoid having to scan/rewrite all records in the class.  Although this is 
>>>>>> a
>>>>>> nice feature to have it's really quite a seperate problem from binary
>>>>>> serialization so I decided to treat them as seperate issues since trying 
>>>>>> to
>>>>>> deal with both at once was really bogging me down.   Looking at your 
>>>>>> issue
>>>>>> though I'd note that my subsclasses of OClassImpl and OPropertyImpl are
>>>>>> actually immutable once constructed so this might help the schema-wide
>>>>>> immutability.
>>>>>>
>>>>>
>>>>>  Good, this would simplify that issue.
>>>>>
>>>>>
>>>>>>     Also realised that per record compression will be rather easy to
>>>>>>> do... But that's in the extras bucket so will leave that as a bonus 
>>>>>>> prize
>>>>>>> once the core functions are sorted and stable.
>>>>>>>
>>>>>>
>>>>>>  We already have per record compression, what do you mean?
>>>>>>
>>>>>>
>>>>>>  I wasn't aware of this.  Perhaps this occurs in the Raw database
>>>>>> layer of the code?  I haven't come across any compression code.  If you
>>>>>> already have per record compression does this negate any potential value 
>>>>>> to
>>>>>> per field compression?  i.e. if (string.length > 1000) compressString()
>>>>>>
>>>>>
>>>>>  We compress at storage level, but always, not with a threshold. This
>>>>> brings to no compression benefits in case of small records, so compression
>>>>> at marshalling time would be preferable: drivers could send compressed
>>>>> records to improve network I/O.
>>>>>
>>>>>  Lvc@
>>>>>
>>>>>
>>>>>
>>>>  --
>>>>
>>>> ---
>>>> You received this message because you are subscribed to the Google
>>>> Groups "OrientDB" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>>
>>>>   --
>>>
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "OrientDB" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>>
>>>    --
>>>
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "OrientDB" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>   --
>>
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "OrientDB" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>>  --
>>
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "OrientDB" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "OrientDB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>
>
>  --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "OrientDB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [orientdb] Schema Driven Binary Serialization - Strings

Reply via email to