lzaeh commented on PR #2339:
URL: https://github.com/apache/fory/pull/2339#issuecomment-3038344099
## Major Modifications & Explanations
Although this PR passes all current tests, the implementation approach is
still debatable and some issues remain. Specifically, while it passes the
existing test cases, the current test coverage is insufficient. For instance,
in the case of pure array cross-language serialization, deserialization on the
Go side still fails. Further investigation and testing will be conducted to
explore and resolve this issue.
This PR mainly addresses the following:
- Deprecated the `allocate ID` mechanism
- Aligned the serialization and deserialization of named structs
- Fixed map serialization/deserialization issues
- Partially resolved issues related to array and slice serialization
------
### 1. Behavior of Named Struct Registration
Currently, when manually registering a named struct, the registration
function **does not** immediately perform actual type registration. Instead,
registration is deferred until the first time the struct is serialized. At
first, I mistakenly believed this to be an intentional lazy registration
mechanism.
However, consider this scenario: both Python and Go sides register the same
named struct, and a serialized struct instance is sent from Python to Go. If
the Go side only performs deserialization (without prior serialization), its
type system hasn't completed the registration, leading to deserialization
failure.
Therefore, just like in Python, performing a **complete type registration at
registration time** is essential. This PR adjusts the Go implementation
accordingly to ensure consistency with other languages.
------
### 2. Handling of Allocate ID
After communicating with Pandalee, we confirmed that the `allocate ID`
mechanism is problematic in cross-language scenarios.
Many local test cases—especially those involving arrays or struct
pointers—rely on the `allocate ID` mechanism to pass. But once tested across
languages, serious compatibility issues arise, which have now been resolved
during the type registration process.
During local named struct tests, I discovered that for the same struct, the
pointer type's typeID is dynamically allocated via `allocate ID`. This allows
nested structs (either value or pointer) to be recognized correctly locally.
At registration time, we can distinguish between value and pointer types
based on the sign of the typeID.
If we don’t differentiate further, we risk overwriting type information:
e.g., both `A` and `*A` (value and pointer types) may share the same
`namespaceBytes` and `typenameBytes`, and whichever is registered last will
overwrite the other. This results in deserialization errors (e.g., returning a
pointer when a value is expected).
To fix this, the `readTypeInfo` phase reconstructs the correct `typeInfo`
based on the sign of the typeID, ensuring accurate deserialization logic.
A related issue is discussed in Section 6: "Array vs Slice".
------
### 3. Unified Reference Tracking Logic
To align with the `need_to_write_ref` behavior, I added an explicit
`need_to_write_ref` method to every serializer.
While this approach may not be elegant, it ensures correct reference
handling for struct types. Future refinements may optimize this triggering
mechanism.
Currently, reference tracking for struct types is supported, but its
implementation still needs to be strengthened, corrected, and is open to
further discussion.
------
### 4. Problems in Map Serialization Logic
Following Pandalee’s suggestion, I temporarily unified the implementation
using `mapSerializer` to improve overall functionality.
The original `write` and `read` logic of `mapSerializer` was inconsistent
with other languages. I have aligned it based on Python’s implementation.
------
### 5. Maps Inside Named Structs vs. Maps Everywhere Else
Based on Python’s serialization logic, we observe the following:
#### For Map fields not inside a struct:
Python writes the full type info for each key/value, ensuring the
deserialization side (other languages) can reconstruct them correctly.
For example, when writing a chunk, Python writes typeID info for the leading
key-value pair. Since the map type is resolved via typeID, the initial
`mapSerializer` starts with `none` for both key and value serializers. After
each chunk, the kv serializers are reset to `none`, which simplifies
deserialization.
#### For Map fields within a named struct:
Python creates a concrete type like `map[int]string`. As key/value types are
known, **typeID info is not written**.
In this case, the deserialization side reconstructs the correct key/value
types using the `mapSerializer` that was registered with the named struct. As
long as the target language also registers the same struct, deserialization
works.
#### From Java’s perspective:
If a map is **not** inside a struct, and no type-specific `mapSerializer`
exists, how can the deserializer determine kv types?
Even if the typeID gives hints, the container `map` still needs to be
instantiated. A common fallback is to use `Map<Object, Object>` (i.e.,
`map[interface{}]interface{}` in Go), letting the user cast types afterward.
------
### Strategy in Go:
1. For **maps inside named structs**, directly instantiate concrete typed
maps.
2. For all **other maps**, fallback to `map[interface{}]interface{}` during
deserialization.
3. Introduced a temporary flag `mapInStruct` in `mapSerializer` to indicate
whether the current serializer originates from a struct field.
4. While `mapInStruct` is conceptually sound, the implementation still has
room for refinement due to my limited coding experience.
------
### 6. Array vs Slice
Originally, I wrote a lengthy explanation here, but after attempting—and
failing—to solve the problem at its root, I removed the original content. What
follows is a concise yet serious statement:
In fory-go, slices are handled adequately, but array types pose significant
challenges.
Concrete slice types are correctly received on the client side.
For array types, only concrete arrays within named structs are properly
received; in all other array scenarios, they’re treated as slices and must be
converted on the client side.
These details and issues are the trickiest I’ve encountered so far. I
haven’t completed a thorough analysis yet and will focus on this problem going
forward.
------
### 7. Support for Multi-Dimensional Slices
- During serialization: handled recursively as `[]interface{}`
- During deserialization: restored as `[]interface{}` and requires user-side
conversion to target type
------
### 8. Other Notes
- Where needed, I aligned Go’s serialization and deserialization logic with
Python’s to ensure cross-language compatibility.
- I’ve corrected the field ordering in named structs, but haven’t yet
addressed the time‐enum cases due to limited familiarity with the type system.
Moreover, our current alignment approach—designed to avoid disrupting the
existing code structure—may be suboptimal (it might be doing too much sorting).
We’ll look into optimizing it further going forward.
- Modifications to Some Tests:
- In cross-language complex struct tests, when aligning struct field
types, F11 should be an array.
- In local testing, I temporarily commented out the []int8 type because
the existing serializer resources appear to be incorrect
- In cross-language struct tests, I chose to test structs by value. If I
were to test with pointer types, then when Python serializes to Go—Python only
transmits values—Go would deserialize the bytes into a value type, unaware that
the test’s starting point was a pointer, and the resulting type mismatch would
cause the test to fail. Unifying tests around value types therefore makes the
most sense. Bidirectional testing of pointer types must be done with a language
system that natively supports pointers.
- Temporarily added a type conversion function, as discussed above
regarding map and array handling.
- Scenarios Currently Requiring Conversion Functions:
- Arrays not in named structs
- Maps not in named structs whose type is not map[interface{}]interface{}
- Multi-dimensional slices
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]