What we tried to achieve is that several encoding could co-exist in a single cluster or even single cache. This would be great from UX perspective. However, from what Andrey wrote, I understand that this would be pretty hard to achieve as we rely heavily on similar binary representation of objects being compared. That said, while this could work for SQL with some adjustments, we will have severe problems with BinaryObject.equals().
Let's think on how we can resolve this. I see two options: 1) Allow only single encoding in the whole cluster. Easy to implement, but very bad from usability perspective. Especially this would affect clients - client nodes, and what is worse, drivers and thin clients! They all would have to bother about which encoding to use. But may be we can share this information during handshake (as every client has a handshake). 2) Add custom eocnding flag/ID to object header if non-standard enconding appears somewhere inside the object (even in nested objects). This way, we will be able to re-create the object if needed if expected and actual encoding doesn't match. For example, consider we have two caches/tables with different encoding (not implemented in current iteration, but we may decide to implement per-cache encodings in future, as this any RDBMS support it). And then I decide to move object A from cache 1 with UTF-8 encoding to cache 2 with Cp1251 encoding. In this case I will detect encoding mismatch through object header (or footer) and re-build it transparently for user. Second option is more preferable to me as a long-term solution, but would require =more efforts. Thoughts? On Wed, Sep 6, 2017 at 3:33 AM, Dmitriy Setrakyan <[email protected]> wrote: > Can we just detect the encoding at cache, or at least column level? This > way if the encoding does not match, we throw an exception immediately. > > Will it work? > > D. > > On Tue, Sep 5, 2017 at 9:16 AM, Andrey Kuznetsov <[email protected]> > wrote: > > > Hi Igniters! > > > > I met a couple of issues related to different binary string encoding > > settings on different cluster nodes. > > > > Let cluster has two nodes. Node0 uses win-1251 to marshal strings with > > BinaryMarshaller and Node1 uses default utf-8 encoding. Let's create > > replicated cache and add some entry to Node0: > > > > node0.cache("myCache").put("k", "v"); > > > > Then > > > > node1.cache("myCache").get("k") > > > > returns null. > > > > Let me describe the cause. First, string key comes to Node1 as binary > > payload of DHT update request, it has win-1251 encoding. This > > representation stays in offheap area of Node1. Then GetTask comes with > the > > same key, plain (Serializable) Java object; BinaryMarshaller encodes the > > key using utf-8 (Node1 setting). Finally, B+Tree lookup fails for this > > binary key due to different encodings. > > > > When the key is just a string then this can be fixed by decoding binary > > strings entirely on B+Tree lookups. But when the key is an arbitrary > object > > with some strings inside this way is too expensive. > > > > The second issue relates to lossy string encodings. Mixed-encoding > cluster > > does not guarantee string data integrity when "lossless" node goes down > for > > a while. > > > > Any ideas on addressing these issues? > > > > -- > > Best regards, > > Andrey Kuznetsov. > > >
