Re: [orientdb] Schema Driven Binary Serialization - Strings

Steve Wed, 14 May 2014 19:18:06 -0700

I'm just adapting the existing binary field serializers to a modified
interface and looking at the existing OStringSerializer.  I notice it
serializes char by char (i.e. 2 bytes per char).  Given that under most
charsets the vast majority of text represented as a single byte I wonder
if we could handle this safely using String.getBytes(charset).


The question is, is there a charset that is a superset of all charsets. 
i.e. can we guarantee that the process of serialize/deserialize will
never lose or alter data.  I'm not really an expert on charsets so I
thought I'd throw this one out there for input.

We could specify a charset per cluster or per DB in the way that mysql
does.  It would be a pain for the user to have to be specifying charsets
by default.  But if the user is charset aware then we can neatly
sidestep this issue.

Any ideas on the best way to handle this?  It would be a shame to double
the storage size of every string in the DB if it's not necessary.

On 15/05/14 01:22, Luca Garulli wrote:
> Hi Steve,
> I guessed you were super busy, no problem about it. Binary Protocol
> will be the first thing Emanuele will work on starting from the end of
> May. Very soon he'll contact you to have some information about last
> version you pushed. He'll help you to integrate your implementation
> inside OrientDB to let all the test cases to pass (thousands).
>
> Thanks,
> Lvc@
>
>
>
> On 14 May 2014 13:26, Steve <[email protected]
> <mailto:[email protected]>> wrote:
>
>     If I read his last email on the subject correctly he already has.
>
>     Again sorry to Luca for not responding, I missed the email when he
>     sent it.
>
>
>
>     On 14/05/14 21:19, [email protected]
>     <mailto:[email protected]> wrote:
>>     Hi,
>>
>>     This is good news, now lets hope Luca can find resources for this
>>     soon.
>>
>>     Regards,
>>      -Stefán
>>
>>     On Wednesday, 14 May 2014 11:10:55 UTC, Steve Coughlan wrote:
>>
>>         Hi Stefan,
>>
>>         Progress has been slow although as I ran into the usual
>>         issue, got bogged down in issues, became obsessed, ended up
>>         spending far more time than I expected, got it the shit from
>>         my employer for neglecting my work, panicked to catch up,
>>         never got back to it ;)
>>
>>         However I did push an update a couple of days ago.  Although
>>         many of the extra's have not been addressed I'm now able to
>>         persist a binary record inside orientdb in and retrieve it
>>         after a restart (proving that it's deserialized from disk not
>>         from cache).  Which implies also being able to persist the
>>         drstically altered schema structure.
>>
>>         Since I had made the field-level serializer pluggable I've
>>         been a jackson-json as the serialization mechanism for easy
>>         debugging.  Now I need to adjust the existing ODB binary
>>         serializers.  They all embed data-length in the serialized
>>         data, which we don't need to do since we store it in
>>         headers.  And I've adjusted the interface slightly.  So I
>>         just need to massage the existing binary serializers a little
>>         to fit the new interface and we will be back to full binary
>>         serialization.
>>
>>         So... some progress, no where near as much as I'd hoped but
>>         now that it actually works inside ODB (before we could only
>>         serialize/deserialize to byte arrays using dummy schema
>>         objects) I believe it's at a point where we can get other ODB
>>         developers involved to review/test/contribute.
>>
>>         I've just noticed a post Luca made a while back that I missed
>>         that he'd employed someone who'll be focussed on this so I
>>         hope we can work together on the rest of the integration. 
>>         Honestly integration has been the hardest part.  I've learned
>>         an awful lot about the internals of ODB the hard way
>>         (apologies for blunt comment but the documentation is awful
>>         and it's very hard to distinguish what is internal/public
>>         API) and also learned I've probably only touched a tiny
>>         fraction of it.
>>
>>
>>         On 14/05/14 19:40, [email protected] wrote:
>>>         Hi,
>>>
>>>         Has something newsworthy happened on this?  :)
>>>
>>>         Best regards,
>>>           -Stefán
>>>
>>>
>>>         On Friday, 18 April 2014 13:57:07 UTC, Lvc@ wrote:
>>>
>>>
>>>                 Slightly different issue I think.  I wasn't clear I
>>>                 was actually talking versioning of individual class
>>>                 schemas rather than global schema version.  This is
>>>                 the part that allows to modify schema and (in some
>>>                 cases) avoid having to scan/rewrite all records in
>>>                 the class.  Although this is a nice feature to have
>>>                 it's really quite a seperate problem from binary
>>>                 serialization so I decided to treat them as seperate
>>>                 issues since trying to deal with both at once was
>>>                 really bogging me down.   Looking at your issue
>>>                 though I'd note that my subsclasses of OClassImpl
>>>                 and OPropertyImpl are actually immutable once
>>>                 constructed so this might help the schema-wide
>>>                 immutability.
>>>
>>>
>>>             Good, this would simplify that issue.
>>>              
>>>
>>>>                     Also realised that per record compression will
>>>>                     be rather easy to do... But that's in the
>>>>                     extras bucket so will leave that as a bonus
>>>>                     prize once the core functions are sorted and
>>>>                     stable.
>>>>
>>>>
>>>>                 We already have per record compression, what do you
>>>>                 mean?
>>>
>>>                 I wasn't aware of this.  Perhaps this occurs in the
>>>                 Raw database layer of the code?  I haven't come
>>>                 across any compression code.  If you already have
>>>                 per record compression does this negate any
>>>                 potential value to per field compression?  i.e. if
>>>                 (string.length > 1000) compressString()
>>>
>>>
>>>             We compress at storage level, but always, not with a
>>>             threshold. This brings to no compression benefits in
>>>             case of small records, so compression at marshalling
>>>             time would be preferable: drivers could send compressed
>>>             records to improve network I/O.
>>>
>>>             Lvc@
>>>
>>>              
>>>
>>>         -- 
>>>
>>>         ---
>>>         You received this message because you are subscribed to the
>>>         Google Groups "OrientDB" group.
>>>         To unsubscribe from this group and stop receiving emails
>>>         from it, send an email to [email protected].
>>>         For more options, visit https://groups.google.com/d/optout.
>>
>>     -- 
>>
>>     ---
>>     You received this message because you are subscribed to the
>>     Google Groups "OrientDB" group.
>>     To unsubscribe from this group and stop receiving emails from it,
>>     send an email to [email protected]
>>     <mailto:[email protected]>.
>>     For more options, visit https://groups.google.com/d/optout.
>
>     -- 
>
>     ---
>     You received this message because you are subscribed to the Google
>     Groups "OrientDB" group.
>     To unsubscribe from this group and stop receiving emails from it,
>     send an email to [email protected]
>     <mailto:[email protected]>.
>     For more options, visit https://groups.google.com/d/optout.
>
>
> -- 
>
> ---
> You received this message because you are subscribed to the Google
> Groups "OrientDB" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to [email protected]
> <mailto:[email protected]>.
> For more options, visit https://groups.google.com/d/optout.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [orientdb] Schema Driven Binary Serialization - Strings

Reply via email to