>> But now multiple components
>> independently serialize strings for their needs and use default encoding
>> for this.
>> For example  DirectByteBufferStreamImplV2#writeString,
>> MetaStorage#writeRaw and so on
We should fix all of them.

>> BinaryUtils#utf8BytesToStr
Lets use this everywhere.

As for me, I'm expecting a way more problem with enforcing rule to fail,
rather than enforcing all components to use UTF-8
Some weird cases  (surrogate pairs) we can (I strongly believe it is OK)
simply do not consider at all.

пн, 13 дек. 2021 г. в 15:15, Nikolay Izhikov <nizhi...@apache.org>:

> > Does Java String support all unicode characters and particularly does it
> support more characters than UTF-8
>
> It’s not about Java, it’s about UTF-8 standard.
>
> Please, take a look at [1]
>
> > In November 2003, UTF-8 was restricted by RFC 3629 to match the
> constraints of the UTF-16 character encoding: explicitly prohibiting code
> points corresponding to the high and low surrogate characters removed more
> than 3% of the three-byte sequences, and ending at U+10FFFF removed more
> than 48% of the four-byte sequences and all five- and six-byte sequences.
>
> And [2]
>
> > The definition of UTF-8 prohibits encoding character numbers between
> U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form
> (as surrogate pairs) and do not directly represent characters.
>
> Actually, we already has some modes to support this restriction of UTF-8.
> Please, take a look at BinaryUtils#utf8BytesToStr [3]
>
>
> [1] https://en.wikipedia.org/wiki/UTF-8
> [2] https://datatracker.ietf.org/doc/html/rfc3629
> [3]
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/binary/BinaryUtils.java#L2387
>
> > 13 дек. 2021 г., в 13:57, Ivan Pavlukhin <vololo...@gmail.com>
> написал(а):
> >
> >> UTF-8 can’t encode all UNICODE characters.
> >
> > Nikolay, could you please elaborate? My understanding is that encoding
> > we speak about matters for conversion from byte arrays to strings.
> > Does Java String support all unicode characters and particularly does
> > it support more characters than UTF-8 (I am not saying here that java
> > String uses UTF-8)?
> >
> > 2021-12-13 12:56 GMT+03:00, Ivan Daschinsky <ivanda...@gmail.com>:
> >> UTF-8 is already a default encoding in our BinaryObject format. So....
> I am
> >> for unification.
> >>
> >> пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov <nizhi...@apache.org>:
> >>
> >>> Hello, Ivan.
> >>>
> >>> UTF-8 can’t encode all UNICODE characters.
> >>>
> >>>> 13 дек. 2021 г., в 12:49, Ivan Daschinsky <ivanda...@gmail.com>
> >>> написал(а):
> >>>>
> >>>> Khm, maybe a better variant is  to enforce all strings to be encoded
> in
> >>>> UTF-8?
> >>>> AFAIK multi OS cluster is a quite common case.
> >>>>
> >>>>
> >>>> пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov <pmgheap....@gmail.com>:
> >>>>
> >>>>> Igniters,
> >>>>>
> >>>>> Recently we faced the problem that if the cluster consists of nodes
> >>>>> running in the JVM with different encodings, many issues arise.
> >>>>> The root cause of the mentioned issues is components that use
> >>>>> `String#getBytes()` and `new String(<byte array>)`, which relies on
> >>>>> the
> >>>>> system default encoding. Thus, if a string is deserialized on a node
> >>>>> with a different encoding from the one that serialized it, the
> >>>>> deserialized string can be different from the original one.
> >>>>>
> >>>>> For example:
> >>>>>
> >>>>> Serialization/deserialization of string in communication messages may
> >>>>> be
> >>>>> broken for some strings on nodes running in a JVM with a different
> >>>>> encoding as DirectByteBufferStreamImplV2 uses String#getBytes() to
> >>>>> serialize strings - [1]
> >>>>>
> >>>>> Or the IgniteAuthenticationProcessor can compute different security
> >>>>> IDs
> >>>>> for the user on different nodes in this case - [2]
> >>>>>
> >>>>> What do you think, if we solve this problem globally, by rejecting to
> >>>>> join nodes that run on JVMs with different encodings?
> >>>>>
> >>>>> As a result, we will be sure that all cluster nodes have the same
> >>>>> encoding and all related problems will be solved.
> >>>>>
> >>>>> [1] - https://issues.apache.org/jira/browse/IGNITE-16106
> >>>>> [2] - https://issues.apache.org/jira/browse/IGNITE-16068
> >>>>>
> >>>>> --
> >>>>> Mikhail
> >>>>>
> >>>>>
> >>>>
> >>>> --
> >>>> Sincerely yours, Ivan Daschinskiy
> >>>
> >>>
> >>
> >> --
> >> Sincerely yours, Ivan Daschinskiy
> >>
> >
> >
> > --
> >
> > Best regards,
> > Ivan Pavlukhin
>
>

-- 
Sincerely yours, Ivan Daschinskiy

Reply via email to