LouisLou2 commented on issue #2096:
URL: https://github.com/apache/fury/issues/2096#issuecomment-2863805158
Regarding the potential derivative problems that this issue might cause:
Consider the case where a developer registers a class and might only provide
the `typename` without a `namespace`, for example:
```java
fury.register(RecordA.class, "recordA"); // namespace is empty, typename is
"recordA"
```
In this situation, when Fury deserializes the `MetaStringBytes`
corresponding to the `namespace` (which is an empty string, thus `len=0`), a
problem arises.
Specifically, in the
`MetaStringResolver.MetaStringReader.readMetaStringBytes` method:
```java
public MetaStringBytes readMetaStringBytes(MemoryBuffer buffer) {
int header = buffer.readVarUint32Small7();
int len = header >>> 1;
if ((header & 0b1) == 0) { // Assuming this condition holds, meaning
this is the first time reading a MetaStringBytes of this len
MetaStringBytes byteString =
len > SMALL_STRING_THRESHOLD
? readBigMetaStringBytes(buffer, len, buffer.readInt64())
: readSmallMetaStringBytes(buffer, len); // When len == 0,
this branch is taken
updateDynamicString(byteString);
return byteString;
} else {
return dynamicReadStringIds[len - 1];
}
}
```
If `(header & 0b1) == 0` (i.e., this is the first time encountering this
`MetaStringBytes` of `len==0` in the current serialization/deserialization
process) and `len == 0`, the code will call `readSmallMetaStringBytes(buffer,
len)`.
Now, let's look at the `readSmallMetaStringBytes` method:
```java
private MetaStringBytes readSmallMetaStringBytes(MemoryBuffer buffer, int
len) {
long v1, v2 = 0;
byte encoding = buffer.readByte(); // Reads the encoding type
if (len <= 8) {
v1 = buffer.readBytesAsInt64(len); // When len == 0, calls
readBytesAsInt64(0)
} else {
v1 = buffer.readInt64();
v2 = buffer.readBytesAsInt64(len - 8);
}
MetaStringBytes byteString = longLongMap.get(v1, v2); // Attempts to get
from cache using v1, v2 as key
if (byteString == null) {
// If cache miss, create a new MetaStringBytes and put it into
longLongMap
byteString = createSmallMetaStringBytes(len, encoding, v1, v2);
}
return byteString;
}
```
When `len == 0`:
1. `v2` is initialized to 0.
2. The value of `v1` is obtained through `buffer.readBytesAsInt64(len)`,
which becomes `buffer.readBytesAsInt64(0)`.
The critical point is the step `v1 = buffer.readBytesAsInt64(0)`. Due to the
behavior of `readBytesAsInt64(0)` as pointed out in the original issue (#2096)
– where for `len=0`, its return value depends on the next 8 bytes in the buffer
rather than being a deterministic value (e.g., 0) that is independent of the
buffer's content – the value of `v1` becomes "unstable." For a zero-length
string, `v1` should ideally be a fixed value (like 0) and not be affected by
subsequent bytes in the buffer.
This unstable `v1` value (while `v2` is 0) renders the caching mechanism of
`longLongMap.get(v1, v2)` for `MetaStringBytes` of length 0 largely
ineffective. Because each call to `readBytesAsInt64(0)` might yield a different
`v1` (depending on the subsequent bytes in the buffer at that moment),
`createSmallMetaStringBytes` is likely to be called frequently. This results in
multiple different mappings from `(v1, v2)` to `MetaStringBytes` being created
in `longLongMap` for the same logical empty string (`len=0`).
The net result is that during each (de)serialization of an object with an
empty `namespace`, a new `MetaStringBytes` instance is likely created for this
empty `namespace` instead of reusing a cached object.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]