Re: [orientdb] Schema Driven Binary Serialization - Strings

Steve Thu, 15 May 2014 01:01:40 -0700

Is there a way to access this programatically (without having to so a db
query every time)?


I found OBinarySerializer.bytesToString() and stringToBytes() which
appears to use single byte encoding for characters where it's possible. 
I think (but I can't say for certain) that this will result in a charset
agnostic encoding of each char.

The other option (the way I normally do this) is to use
String.getBytes(charset).  Which we could do if there is a global DB
charset setting however we would run into an issue where if the charset
was changed we may have to rewrite every string in the database?


On 15/05/14 17:32, Luca Garulli wrote:
> Hi Steve,
> OrientDB already has a charset setting at database level, to change it:
>
> alter database charset utf-8
>
> Maybe we could treat char like you did with integer: save the bits if
> the content doesn't use 2 bytes.
>
> Lvc@
>
> On 15 May 2014 04:17, Steve <[email protected]
> <mailto:[email protected]>> wrote:
>
>     I'm just adapting the existing binary field serializers to a
>     modified interface and looking at the existing OStringSerializer. 
>     I notice it serializes char by char (i.e. 2 bytes per char). 
>     Given that under most charsets the vast majority of text
>     represented as a single byte I wonder if we could handle this
>     safely using String.getBytes(charset).
>
>     The question is, is there a charset that is a superset of all
>     charsets.  i.e. can we guarantee that the process of
>     serialize/deserialize will never lose or alter data.  I'm not
>     really an expert on charsets so I thought I'd throw this one out
>     there for input.
>
>     We could specify a charset per cluster or per DB in the way that
>     mysql does.  It would be a pain for the user to have to be
>     specifying charsets by default.  But if the user is charset aware
>     then we can neatly sidestep this issue.
>
>     Any ideas on the best way to handle this?  It would be a shame to
>     double the storage size of every string in the DB if it's not
>     necessary.
>
>     On 15/05/14 01:22, Luca Garulli wrote:
>>     Hi Steve,
>>     I guessed you were super busy, no problem about it. Binary
>>     Protocol will be the first thing Emanuele will work on starting
>>     from the end of May. Very soon he'll contact you to have some
>>     information about last version you pushed. He'll help you to
>>     integrate your implementation inside OrientDB to let all the test
>>     cases to pass (thousands).
>>
>>     Thanks,
>>     Lvc@
>>
>>
>>
>>     On 14 May 2014 13:26, Steve <[email protected]
>>     <mailto:[email protected]>> wrote:
>>
>>         If I read his last email on the subject correctly he already has.
>>
>>         Again sorry to Luca for not responding, I missed the email
>>         when he sent it.
>>
>>
>>
>>         On 14/05/14 21:19, [email protected]
>>         <mailto:[email protected]> wrote:
>>>         Hi,
>>>
>>>         This is good news, now lets hope Luca can find resources for
>>>         this soon.
>>>
>>>         Regards,
>>>          -Stefán
>>>
>>>         On Wednesday, 14 May 2014 11:10:55 UTC, Steve Coughlan wrote:
>>>
>>>             Hi Stefan,
>>>
>>>             Progress has been slow although as I ran into the usual
>>>             issue, got bogged down in issues, became obsessed, ended
>>>             up spending far more time than I expected, got it the
>>>             shit from my employer for neglecting my work, panicked
>>>             to catch up, never got back to it ;)
>>>
>>>             However I did push an update a couple of days ago. 
>>>             Although many of the extra's have not been addressed I'm
>>>             now able to persist a binary record inside orientdb in
>>>             and retrieve it after a restart (proving that it's
>>>             deserialized from disk not from cache).  Which implies
>>>             also being able to persist the drstically altered schema
>>>             structure.
>>>
>>>             Since I had made the field-level serializer pluggable
>>>             I've been a jackson-json as the serialization mechanism
>>>             for easy debugging.  Now I need to adjust the existing
>>>             ODB binary serializers.  They all embed data-length in
>>>             the serialized data, which we don't need to do since we
>>>             store it in headers.  And I've adjusted the interface
>>>             slightly.  So I just need to massage the existing binary
>>>             serializers a little to fit the new interface and we
>>>             will be back to full binary serialization.
>>>
>>>             So... some progress, no where near as much as I'd hoped
>>>             but now that it actually works inside ODB (before we
>>>             could only serialize/deserialize to byte arrays using
>>>             dummy schema objects) I believe it's at a point where we
>>>             can get other ODB developers involved to
>>>             review/test/contribute.
>>>
>>>             I've just noticed a post Luca made a while back that I
>>>             missed that he'd employed someone who'll be focussed on
>>>             this so I hope we can work together on the rest of the
>>>             integration.  Honestly integration has been the hardest
>>>             part.  I've learned an awful lot about the internals of
>>>             ODB the hard way (apologies for blunt comment but the
>>>             documentation is awful and it's very hard to distinguish
>>>             what is internal/public API) and also learned I've
>>>             probably only touched a tiny fraction of it.
>>>
>>>
>>>             On 14/05/14 19:40, [email protected] wrote:
>>>>             Hi,
>>>>
>>>>             Has something newsworthy happened on this?  :)
>>>>
>>>>             Best regards,
>>>>               -Stefán
>>>>
>>>>
>>>>             On Friday, 18 April 2014 13:57:07 UTC, Lvc@ wrote:
>>>>
>>>>
>>>>                     Slightly different issue I think.  I wasn't
>>>>                     clear I was actually talking versioning of
>>>>                     individual class schemas rather than global
>>>>                     schema version.  This is the part that allows
>>>>                     to modify schema and (in some cases) avoid
>>>>                     having to scan/rewrite all records in the
>>>>                     class.  Although this is a nice feature to have
>>>>                     it's really quite a seperate problem from
>>>>                     binary serialization so I decided to treat them
>>>>                     as seperate issues since trying to deal with
>>>>                     both at once was really bogging me down.  
>>>>                     Looking at your issue though I'd note that my
>>>>                     subsclasses of OClassImpl and OPropertyImpl are
>>>>                     actually immutable once constructed so this
>>>>                     might help the schema-wide immutability.
>>>>
>>>>
>>>>                 Good, this would simplify that issue.
>>>>                  
>>>>
>>>>>                         Also realised that per record compression
>>>>>                         will be rather easy to do... But that's in
>>>>>                         the extras bucket so will leave that as a
>>>>>                         bonus prize once the core functions are
>>>>>                         sorted and stable.
>>>>>
>>>>>
>>>>>                     We already have per record compression, what
>>>>>                     do you mean?
>>>>
>>>>                     I wasn't aware of this.  Perhaps this occurs in
>>>>                     the Raw database layer of the code?  I haven't
>>>>                     come across any compression code.  If you
>>>>                     already have per record compression does this
>>>>                     negate any potential value to per field
>>>>                     compression?  i.e. if (string.length > 1000)
>>>>                     compressString()
>>>>
>>>>
>>>>                 We compress at storage level, but always, not with
>>>>                 a threshold. This brings to no compression benefits
>>>>                 in case of small records, so compression at
>>>>                 marshalling time would be preferable: drivers could
>>>>                 send compressed records to improve network I/O.
>>>>
>>>>                 Lvc@
>>>>
>>>>                  
>>>>
>>>>             -- 
>>>>
>>>>             ---
>>>>             You received this message because you are subscribed to
>>>>             the Google Groups "OrientDB" group.
>>>>             To unsubscribe from this group and stop receiving
>>>>             emails from it, send an email to
>>>>             [email protected].
>>>>             For more options, visit https://groups.google.com/d/optout.
>>>
>>>         -- 
>>>
>>>         ---
>>>         You received this message because you are subscribed to the
>>>         Google Groups "OrientDB" group.
>>>         To unsubscribe from this group and stop receiving emails
>>>         from it, send an email to
>>>         [email protected]
>>>         <mailto:[email protected]>.
>>>         For more options, visit https://groups.google.com/d/optout.
>>
>>         -- 
>>
>>         ---
>>         You received this message because you are subscribed to the
>>         Google Groups "OrientDB" group.
>>         To unsubscribe from this group and stop receiving emails from
>>         it, send an email to
>>         [email protected]
>>         <mailto:[email protected]>.
>>         For more options, visit https://groups.google.com/d/optout.
>>
>>
>>     -- 
>>
>>     ---
>>     You received this message because you are subscribed to the
>>     Google Groups "OrientDB" group.
>>     To unsubscribe from this group and stop receiving emails from it,
>>     send an email to [email protected]
>>     <mailto:[email protected]>.
>>     For more options, visit https://groups.google.com/d/optout.
>
>     -- 
>
>     ---
>     You received this message because you are subscribed to the Google
>     Groups "OrientDB" group.
>     To unsubscribe from this group and stop receiving emails from it,
>     send an email to [email protected]
>     <mailto:[email protected]>.
>     For more options, visit https://groups.google.com/d/optout.
>
>
> -- 
>
> ---
> You received this message because you are subscribed to the Google
> Groups "OrientDB" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to [email protected]
> <mailto:[email protected]>.
> For more options, visit https://groups.google.com/d/optout.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [orientdb] Schema Driven Binary Serialization - Strings

Reply via email to