Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

Vladimir Ozerov Fri, 28 Jul 2017 04:15:25 -0700

As Pavel mentioned, Marshaller should not be tied to cache, BinaryObject
should be self-explanatory, i.e. containing all information necessary for
unmarshalling. This is an absolute requirement.


We will have one extra byte for in serialized form, meaning that advantage
of custom encoding will become evident for all strings with length >= 1,
which is perfectly fine. I do not quite understand what are we arguing
about.

As far as configuration, we can do it as follows:

1) Add global encoding, UTF8 by default.
2) Add per-cache encoding.
3) Add encoding to JDBC and ODBC driver properties.

This should be enough.

пт, 28 июля 2017 г. в 11:45, Pavel Tupitsyn <ptupit...@apache.org>:

> Val, of course other options should be available, such as
> BinaryTypeConfiguration,
> and maybe field-level and class-level annotations.
>
> On Thu, Jul 27, 2017 at 9:07 PM, Valentin Kulichenko <
> valentin.kuliche...@gmail.com> wrote:
>
> > Pavel,
> >
> > This forces user to implement Binarylizable for whole type in case they
> > want to change encoding for one-two fields, right? I really don't like
> it,
> > why not add default encoding to BinaryTypeConfiguration?
> >
> > -Val
> >
> > On Thu, Jul 27, 2017 at 7:54 AM, Pavel Tupitsyn <ptupit...@apache.org>
> > wrote:
> >
> > > > 1 byte for every field just for this
> > > GridBinaryMarshaller.STRING data type remains untouched.
> > > We add GridBinaryMarshaller.STRING_ENCODED, which has additional byte
> > for
> > > encoding type.
> > >
> > > This means no overhead for existing code.
> > > I think the most common use case is English, which uses 1 byte per char
> > in
> > > UTF-8.
> > > This is already as fast and compact as possible, and we don't want to
> > > introduce any lookup overhead here.
> > >
> > > And when user knows that their data will be more compact in some
> specific
> > > encoding,
> > > they use some BinaryWriter.writeString overload, which writes a
> different
> > > type code.
> > >
> > > Yes, it also writes an extra byte, but you save a byte per char of the
> > > actual string
> > > (for example, when using Windows-1251 for Russian text), so this does
> not
> > > matter.
> > >
> > > On Thu, Jul 27, 2017 at 5:35 PM, Dmitriy Setrakyan <
> > dsetrak...@apache.org>
> > > wrote:
> > >
> > > > Pavel, what would be the size overhead? Are we adding 1 byte for
> every
> > > > field just for this? If you would like to have this info in the
> binary
> > > > object directly, can we in this case have some bitmap of
> > > field-to-encoding?
> > > >
> > > > D.
> > > >
> > > > On Thu, Jul 27, 2017 at 9:22 AM, Pavel Tupitsyn <
> ptupit...@apache.org>
> > > > wrote:
> > > >
> > > > > I'm not sure I uderstand how this "per field" configuration is
> > supposed
> > > > to
> > > > > be implemented.
> > > > > * Marshaller is not tied to a cache. It serializes all kinds of
> > things,
> > > > > like compute job parameters and results.
> > > > > * Raw mode does not involve field names.
> > > > >
> > > > > Also it seems like a complicated and expensive solution - looking
> up
> > > > string
> > > > > format somewhere in the metadata will be slow.
> > > > >
> > > > > "encoded string" data type suggestion from Vladimir looks better to
> > me
> > > > from
> > > > > performance and implementation standpoint.
> > > > >
> > > > > Thanks,
> > > > > Pavel
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Jul 27, 2017 at 5:10 PM, Dmitriy Setrakyan <
> > > > dsetrak...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > On Thu, Jul 27, 2017 at 9:04 AM, Igor Sapego <isap...@apache.org
> >
> > > > wrote:
> > > > > >
> > > > > > > Just a note from the platforms guy:
> > > > > > >
> > > > > > > Solution with table-level configuration is going to be
> > > significantly
> > > > > > > harder to implement for platforms and ODBC then field-level
> one.
> > > > > > >
> > > > > >
> > > > > > Igor, it seems like you are advocating the per-cell
> configuration,
> > > not
> > > > > > per-field one. The per-field configuration can be defined at the
> > > > > > table/cache level.
> > > > > >
> > > > > > I see your point about C++ and .NET integrations however. Can't
> we
> > > > > provide
> > > > > > this info at node-join time or table-creation time? This way all
> > > nodes
> > > > > will
> > > > > > receive it and you will be able to grab it on different
> platforms.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Also, what about binary objects, which are not stored in cache,
> > > > > > > but being marshalled?
> > > > > > >
> > > > > >
> > > > > > I think the default system encoding should be used here. If we
> > don't
> > > > have
> > > > > > configuration for default encoding, we should add it.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Best Regards,
> > > > > > > Igor
> > > > > > >
> > > > > > > On Wed, Jul 26, 2017 at 7:22 PM, Dmitriy Setrakyan <
> > > > > > dsetrak...@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > On Wed, Jul 26, 2017 at 3:40 AM, Vyacheslav Daradur <
> > > > > > daradu...@gmail.com
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > Encoding must be set on per field basis. This will give
> us
> > as
> > > > > most
> > > > > > > > > flexible
> > > > > > > > > > solution at the cost of 1-byte overhead.
> > > > > > > > >
> > > > > > > > > > Vova, I agree that the encoding should be set on
> per-field
> > > > basis,
> > > > > > but
> > > > > > > > at
> > > > > > > > > > the table level, not at a cell level.
> > > > > > > > >
> > > > > > > > > Dmitriy, Vladimir,
> > > > > > > > > Let's use both approaches :-)
> > > > > > > > > We can add parameter to CacheConfiguration.
> > > > > > > > > If parameter specifie to use cache level encoding then
> > > marshaller
> > > > > > will
> > > > > > > > use
> > > > > > > > > encoding in a cache,
> > > > > > > > > otherwise marshaller will use per-field encoding.
> > > > > > > > > Of course only if it doesn't complicate the solution.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > I think that it will complicate the solution and will
> > complicate
> > > > the
> > > > > > > > marshalling protocol. The advantage of specifying the
> encoding
> > at
> > > > > > > > table/cache level is that we don't need to add extra encoding
> > > bytes
> > > > > to
> > > > > > > the
> > > > > > > > marshalling protocol.
> > > > > > > >
> > > > > > > > I think Vova was suggesting encoding at the cell level, not
> at
> > > the
> > > > > > field
> > > > > > > > level, which seems to be redundant to me.
> > > > > > > >
> > > > > > > > Vova, do you agree?
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

Reply via email to