LouisLou2 commented on issue #2096:
URL: https://github.com/apache/fury/issues/2096#issuecomment-2863805158

   Regarding the potential derivative problems that this issue might cause:
   
   Consider the case where a developer registers a class and might only provide 
the `typename` without a `namespace`, for example:
   ```java
   fury.register(RecordA.class, "recordA"); // namespace is empty, typename is 
"recordA"
   ```
   In this situation, when Fury deserializes the `MetaStringBytes` 
corresponding to the `namespace` (which is an empty string, thus `len=0`), a 
problem arises.
   
   Specifically, in the 
`MetaStringResolver.MetaStringReader.readMetaStringBytes` method:
   ```java
   public MetaStringBytes readMetaStringBytes(MemoryBuffer buffer) {
       int header = buffer.readVarUint32Small7();
       int len = header >>> 1;
       if ((header & 0b1) == 0) { // Assuming this condition holds, meaning 
this is the first time reading a MetaStringBytes of this len
         MetaStringBytes byteString =
             len > SMALL_STRING_THRESHOLD
                 ? readBigMetaStringBytes(buffer, len, buffer.readInt64())
                 : readSmallMetaStringBytes(buffer, len); // When len == 0, 
this branch is taken
         updateDynamicString(byteString);
         return byteString;
       } else {
         return dynamicReadStringIds[len - 1];
       }
     }
   ```
   If `(header & 0b1) == 0` (i.e., this is the first time encountering this 
`MetaStringBytes` of `len==0` in the current serialization/deserialization 
process) and `len == 0`, the code will call `readSmallMetaStringBytes(buffer, 
len)`.
   
   Now, let's look at the `readSmallMetaStringBytes` method:
   ```java
   private MetaStringBytes readSmallMetaStringBytes(MemoryBuffer buffer, int 
len) {
       long v1, v2 = 0;
       byte encoding = buffer.readByte(); // Reads the encoding type
       if (len <= 8) {
         v1 = buffer.readBytesAsInt64(len); // When len == 0, calls 
readBytesAsInt64(0)
       } else {
         v1 = buffer.readInt64();
         v2 = buffer.readBytesAsInt64(len - 8);
       }
       MetaStringBytes byteString = longLongMap.get(v1, v2); // Attempts to get 
from cache using v1, v2 as key
       if (byteString == null) {
         // If cache miss, create a new MetaStringBytes and put it into 
longLongMap
         byteString = createSmallMetaStringBytes(len, encoding, v1, v2);
       }
       return byteString;
     }
   ```
   When `len == 0`:
   1.  `v2` is initialized to 0.
   2.  The value of `v1` is obtained through `buffer.readBytesAsInt64(len)`, 
which becomes `buffer.readBytesAsInt64(0)`.
   
   The critical point is the step `v1 = buffer.readBytesAsInt64(0)`. Due to the 
behavior of `readBytesAsInt64(0)` as pointed out in the original issue (#2096) 
– where for `len=0`, its return value depends on the next 8 bytes in the buffer 
rather than being a deterministic value (e.g., 0) that is independent of the 
buffer's content – the value of `v1` becomes "unstable." For a zero-length 
string, `v1` should ideally be a fixed value (like 0) and not be affected by 
subsequent bytes in the buffer.
   
   This unstable `v1` value (while `v2` is 0) renders the caching mechanism of 
`longLongMap.get(v1, v2)` for `MetaStringBytes` of length 0 largely 
ineffective. Because each call to `readBytesAsInt64(0)` might yield a different 
`v1` (depending on the subsequent bytes in the buffer at that moment), 
`createSmallMetaStringBytes` is likely to be called frequently. This results in 
multiple different mappings from `(v1, v2)` to `MetaStringBytes` being created 
in `longLongMap` for the same logical empty string (`len=0`).
   
   The net result is that during each (de)serialization of an object with an 
empty `namespace`, a new `MetaStringBytes` instance is likely created for this 
empty `namespace` instead of reusing a cached object. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to