Re: [PR] perf(Go/python): Type system compatibility upgrade [fory]

via GitHub Sat, 05 Jul 2025 00:22:56 -0700


lzaeh commented on PR #2339:
URL: https://github.com/apache/fory/pull/2339#issuecomment-3038344099


   ## Major Modifications & Explanations
   
   Although this PR passes all current tests, the implementation approach is 
still debatable and some issues remain. Specifically, while it passes the 
existing test cases, the current test coverage is insufficient. For instance, 
in the case of pure array cross-language serialization, deserialization on the 
Go side still fails. Further investigation and testing will be conducted to 
explore and resolve this issue.
   
   This PR mainly addresses the following:
   
   - Deprecated the `allocate ID` mechanism  
   - Aligned the serialization and deserialization of named structs  
   - Fixed map serialization/deserialization issues  
   - Partially resolved issues related to array and slice serialization
   
   ------
   
   ### 1. Behavior of Named Struct Registration
   
   Currently, when manually registering a named struct, the registration 
function **does not** immediately perform actual type registration. Instead, 
registration is deferred until the first time the struct is serialized. At 
first, I mistakenly believed this to be an intentional lazy registration 
mechanism.
   
   However, consider this scenario: both Python and Go sides register the same 
named struct, and a serialized struct instance is sent from Python to Go. If 
the Go side only performs deserialization (without prior serialization), its 
type system hasn't completed the registration, leading to deserialization 
failure.
   
   Therefore, just like in Python, performing a **complete type registration at 
registration time** is essential. This PR adjusts the Go implementation 
accordingly to ensure consistency with other languages.
   
   ------
   
   ### 2. Handling of Allocate ID
   
   After communicating with Pandalee, we confirmed that the `allocate ID` 
mechanism is problematic in cross-language scenarios.
   
   Many local test cases—especially those involving arrays or struct 
pointers—rely on the `allocate ID` mechanism to pass. But once tested across 
languages, serious compatibility issues arise, which have now been resolved 
during the type registration process.
   
   During local named struct tests, I discovered that for the same struct, the 
pointer type's typeID is dynamically allocated via `allocate ID`. This allows 
nested structs (either value or pointer) to be recognized correctly locally.
   
   At registration time, we can distinguish between value and pointer types 
based on the sign of the typeID.
   
   If we don’t differentiate further, we risk overwriting type information: 
e.g., both `A` and `*A` (value and pointer types) may share the same 
`namespaceBytes` and `typenameBytes`, and whichever is registered last will 
overwrite the other. This results in deserialization errors (e.g., returning a 
pointer when a value is expected).
   
   To fix this, the `readTypeInfo` phase reconstructs the correct `typeInfo` 
based on the sign of the typeID, ensuring accurate deserialization logic.
   
   A related issue is discussed in Section 6: "Array vs Slice".
   
   ------
   
   ### 3. Unified Reference Tracking Logic
   
   To align with the `need_to_write_ref` behavior, I added an explicit 
`need_to_write_ref` method to every serializer.
   
   While this approach may not be elegant, it ensures correct reference 
handling for struct types. Future refinements may optimize this triggering 
mechanism.
   
   Currently, reference tracking for struct types is supported, but its 
implementation still needs to be strengthened, corrected, and is open to 
further discussion.
   
   ------
   
   ### 4. Problems in Map Serialization Logic
   
   Following Pandalee’s suggestion, I temporarily unified the implementation 
using `mapSerializer` to improve overall functionality.
   
   The original `write` and `read` logic of `mapSerializer` was inconsistent 
with other languages. I have aligned it based on Python’s implementation.
   
   ------
   
   ### 5. Maps Inside Named Structs vs. Maps Everywhere Else
   
   Based on Python’s serialization logic, we observe the following:
   
   #### For Map fields not inside a struct:
   
   Python writes the full type info for each key/value, ensuring the 
deserialization side (other languages) can reconstruct them correctly.
   
   For example, when writing a chunk, Python writes typeID info for the leading 
key-value pair. Since the map type is resolved via typeID, the initial 
`mapSerializer` starts with `none` for both key and value serializers. After 
each chunk, the kv serializers are reset to `none`, which simplifies 
deserialization.
   
   #### For Map fields within a named struct:
   
   Python creates a concrete type like `map[int]string`. As key/value types are 
known, **typeID info is not written**.
   
   In this case, the deserialization side reconstructs the correct key/value 
types using the `mapSerializer` that was registered with the named struct. As 
long as the target language also registers the same struct, deserialization 
works.
   
   #### From Java’s perspective:
   
   If a map is **not** inside a struct, and no type-specific `mapSerializer` 
exists, how can the deserializer determine kv types?
   
   Even if the typeID gives hints, the container `map` still needs to be 
instantiated. A common fallback is to use `Map<Object, Object>` (i.e., 
`map[interface{}]interface{}` in Go), letting the user cast types afterward.
   
   ------
   
   ### Strategy in Go:
   
   1. For **maps inside named structs**, directly instantiate concrete typed 
maps.  
   2. For all **other maps**, fallback to `map[interface{}]interface{}` during 
deserialization.  
   3. Introduced a temporary flag `mapInStruct` in `mapSerializer` to indicate 
whether the current serializer originates from a struct field.  
   4. While `mapInStruct` is conceptually sound, the implementation still has 
room for refinement due to my limited coding experience.
   
   ------
   
   ### 6. Array vs Slice
   
   Originally, I wrote a lengthy explanation here, but after attempting—and 
failing—to solve the problem at its root, I removed the original content. What 
follows is a concise yet serious statement:
   
   In fory-go, slices are handled adequately, but array types pose significant 
challenges.
   
   Concrete slice types are correctly received on the client side.
   
   For array types, only concrete arrays within named structs are properly 
received; in all other array scenarios, they’re treated as slices and must be 
converted on the client side.
   
   These details and issues are the trickiest I’ve encountered so far. I 
haven’t completed a thorough analysis yet and will focus on this problem going 
forward.
   
   ------
   
   ### 7. Support for Multi-Dimensional Slices
   
   - During serialization: handled recursively as `[]interface{}`  
   - During deserialization: restored as `[]interface{}` and requires user-side 
conversion to target type
   
   ------
   
   ### 8. Other Notes
   
   - Where needed, I aligned Go’s serialization and deserialization logic with 
Python’s to ensure cross-language compatibility.
   - I’ve corrected the field ordering in named structs, but haven’t yet 
addressed the time‐enum cases due to limited familiarity with the type system. 
Moreover, our current alignment approach—designed to avoid disrupting the 
existing code structure—may be suboptimal (it might be doing too much sorting). 
We’ll look into optimizing it further going forward.
   - Modifications to Some Tests:
      - In cross-language complex struct tests, when aligning struct field 
types, F11 should be an array.
      - In local testing, I temporarily commented out the []int8 type because 
the existing serializer resources appear to be incorrect
      - In cross-language struct tests, I chose to test structs by value. If I 
were to test with pointer types, then when Python serializes to Go—Python only 
transmits values—Go would deserialize the bytes into a value type, unaware that 
the test’s starting point was a pointer, and the resulting type mismatch would 
cause the test to fail. Unifying tests around value types therefore makes the 
most sense. Bidirectional testing of pointer types must be done with a language 
system that natively supports pointers.
      - Temporarily added a type conversion function, as discussed above 
regarding map and array handling.
   - Scenarios Currently Requiring Conversion Functions:
      - Arrays not in named structs
      - Maps not in named structs whose type is not map[interface{}]interface{}
      - Multi-dimensional slices


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] perf(Go/python): Type system compatibility upgrade [fory]

Reply via email to