Re: Custom string encoding

Vladimir Ozerov Sat, 01 Jul 2017 23:53:38 -0700

Valya,

Personally I vote against this feature. BinaryConfiguration is proven to be
inconvenient, since it has to be configured before node start, it cannot be
changed in runtime, and it requires classes on the server. Moreover, if you
decide to change encoding at some point, it would be impossible.


I think, we should add this feature on API level instead. If string is
written in non-UTF8 form, we will write in different format:
[encoding_code][string]

BInaryWriter.writeString(String fieldName, String val);
BInaryWriter.writeString(String fieldName, String val, *String encoding*);

BinaryReader.readString(String fieldName);
BinaryReader.readString(String fieldName, *String encoding*);

BinaryObjectBuilder.writeString(String fieldName, String val, *String
encoding*);

class MyClass {
    *@BinaryString(encoding = "Cp1251")*
    private String myCyrillicString;
}

Vladimir.

On Sat, Jul 1, 2017 at 7:26 PM, Dmitriy Setrakyan <[email protected]>
wrote:

> On Sat, Jul 1, 2017 at 2:24 AM, Sergi Vladykin <[email protected]>
> wrote:
>
> > In SQL indexes we may store partial strings and assume them to be in
> UTF-8,
> > I don't think this can be abstracted away. But may be this is not a big
> > deal if in indexes we still will use UTF-8.
> >
>
> Sergi, why does it matter if it is UTF8 or custom encoding? Why can't we
> use our own compact encoding in indexes?
>
>
> >
> > 2017-07-01 10:13 GMT+03:00 Dmitriy Setrakyan <[email protected]>:
> >
> > > Val, do you know how we compare strings in SQL queries? Will we be able
> > to
> > > use this encoder?
> > >
> > > Additionally, I think that the encoder is a bit too abstract. Why not
> go
> > > even further and allow users create their own ASCII table for encoding?
> > >
> > > D.
> > >
> > > On Fri, Jun 30, 2017 at 6:49 PM, Valentin Kulichenko <
> > > [email protected]> wrote:
> > >
> > > > Andrey,
> > > >
> > > > Can you elaborate more on this? What is your concern?
> > > >
> > > > -Val
> > > >
> > > > On Fri, Jun 30, 2017 at 6:17 PM Andrey Mashenkov <
> > > > [email protected]>
> > > > wrote:
> > > >
> > > > > Val,
> > > > >
> > > > > Looks like make sense.
> > > > >
> > > > > This will not affect FullText index, as Lucene has own format for
> > > storing
> > > > > data.
> > > > >
> > > > > But.. would it be compatible with H2 indexing ? I doubt.
> > > > >
> > > > > 1 июля 2017 г. 2:27 пользователь "Valentin Kulichenko" <
> > > > > [email protected]> написал:
> > > > >
> > > > > > Folks,
> > > > > >
> > > > > > Currently binary marshaller always encodes strings in UTF-8.
> > However,
> > > > > > sometimes it can be useful to customize this. For example, if
> data
> > > > > contains
> > > > > > a lot of Cyrillic, Chinese or other symbols, but not so many
> Latin
> > > > > symbols,
> > > > > > memory is used very inefficiently. In this case it would be great
> > to
> > > > > encode
> > > > > > most frequently used symbols in one byte instead of two or three.
> > > > > >
> > > > > > I propose to introduce BinaryStringEncoder interface that will
> > > convert
> > > > > > strings to byte arrays and back, and make it pluggable via
> > > > > > BinaryConfiguration. This will allow users to plug in any
> encoding
> > > > > > algorithms based on their requirements.
> > > > > >
> > > > > > Thoughts?
> > > > > >
> > > > > > https://issues.apache.org/jira/browse/IGNITE-5655
> > > > > >
> > > > > > -Val
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Custom string encoding

Reply via email to