Re: [DISCUSSION] Reject join of nodes with different character encodings

Mikhail Petrov Mon, 13 Dec 2021 04:14:42 -0800

Ivan Daschinsky,

better variant is  to enforce all strings to be encoded in
UTF-8

I agree that it is possible way to go. But now multiple componentsindependently serialize strings for their needs and use default encodingfor this.For example DirectByteBufferStreamImplV2#writeString,MetaStorage#writeRaw and so on. Even if we fix all this cases we cannotguarantee that described above problem will not arise again.

Also it seems to be easy for the user to specify encoding for theIgnite Java process manually - through `file.encoding` system property.


Ivan Pavlukhin,

I guess Nikolay is talking about the problem with UTF-8 in case stringcontains unpaired surrogate symbols (e.g. used for encoding in UTF-16).In this case UTF-8 fails to serialize this string correctly sinceunpaired surrogates characters are forbidden in UTF-8. Though thisproblem was solved for binary marshaller - see`BinaryWriterExImpl#doWriteString` and `BinaryUtils#strToUtf8Bytes`


On 13.12.2021 13:57, Ivan Pavlukhin wrote:

UTF-8 can’t encode all UNICODE characters.

Nikolay, could you please elaborate? My understanding is that encoding
we speak about matters for conversion from byte arrays to strings.
Does Java String support all unicode characters and particularly does
it support more characters than UTF-8 (I am not saying here that java
String uses UTF-8)?

2021-12-13 12:56 GMT+03:00, Ivan Daschinsky <ivanda...@gmail.com>:

UTF-8 is already a default encoding in our BinaryObject format. So.... I am
for unification.

пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov <nizhi...@apache.org>:

Hello, Ivan.

UTF-8 can’t encode all UNICODE characters.

13 дек. 2021 г., в 12:49, Ivan Daschinsky <ivanda...@gmail.com>

написал(а):

Khm, maybe a better variant is  to enforce all strings to be encoded in
UTF-8?
AFAIK multi OS cluster is a quite common case.


пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov <pmgheap....@gmail.com>:

Igniters,

Recently we faced the problem that if the cluster consists of nodes
running in the JVM with different encodings, many issues arise.
The root cause of the mentioned issues is components that use
`String#getBytes()` and `new String(<byte array>)`, which relies on
the
system default encoding. Thus, if a string is deserialized on a node
with a different encoding from the one that serialized it, the
deserialized string can be different from the original one.

For example:

Serialization/deserialization of string in communication messages may
be
broken for some strings on nodes running in a JVM with a different
encoding as DirectByteBufferStreamImplV2 uses String#getBytes() to
serialize strings - [1]

Or the IgniteAuthenticationProcessor can compute different security
IDs
for the user on different nodes in this case - [2]

What do you think, if we solve this problem globally, by rejecting to
join nodes that run on JVMs with different encodings?

As a result, we will be sure that all cluster nodes have the same
encoding and all related problems will be solved.

[1] - https://issues.apache.org/jira/browse/IGNITE-16106
[2] - https://issues.apache.org/jira/browse/IGNITE-16068

--
Mikhail

--
Sincerely yours, Ivan Daschinskiy

--
Sincerely yours, Ivan Daschinskiy

Re: [DISCUSSION] Reject join of nodes with different character encodings

Reply via email to