[I] [Python] Create ForyField descriptor for dataclass field metadata [fory]

via GitHub Thu, 04 Dec 2025 21:19:26 -0800


chaokunyang opened a new issue, #3002:
URL: https://github.com/apache/fory/issues/3002


   ## Feature Request
   
   Create a `ForyField` descriptor using Python's dataclass field style design 
for performance and space optimization during xlang serialization.
   
   ## Is your feature request related to a problem? Please describe
   
   Currently, Fory's Python xlang serialization treats all object fields 
uniformly:
   1. **Null checks are always performed** - Even for fields that are never 
null, Fory writes a null/ref flag (1 byte per field)
   2. **Reference tracking is always applied** (when enabled globally) - Even 
for fields that won't be shared/cyclic, objects are added to identity tracking 
with hash lookup cost
   3. **Field names use meta string encoding** - In schema evolution mode, 
field names are encoded using meta string compression, but for fields with long 
names, this still takes space
   
   These defaults ensure correctness but introduce unnecessary overhead when 
the developer has more specific knowledge about their data model.
   
   ## Describe the solution you'd like
   
   Add a `ForyField` descriptor that integrates with Python's dataclass and 
type hint system:
   
   ```python
   from dataclasses import dataclass
   from pyfory import Fory, ForyField
   from typing import Optional
   
   @dataclass
   class Foo:
       # Field f1: non-nullable (default), no ref tracking (default)
       # Tag ID 0 provides compact encoding in schema evolution mode
       f1: str = ForyField(id=0)
       
       # Field f2: non-nullable (default), no ref tracking (default)
       f2: Bar = ForyField(id=1)
       
       # Field f3: nullable field that may contain null values
       f3: Optional[str] = ForyField(id=2, nullable=True)
       
       # Field f4: shared reference that needs tracking (e.g., for circular 
refs)
       parent: Optional[Node] = ForyField(id=3, ref=True, nullable=True)
       
       # Field with long name: tag ID provides significant space savings
       very_long_field_name_that_would_take_many_bytes: str = ForyField(id=4)
       
       # Explicit opt-out: use field name encoding but get nullable optimization
       optional_field: Optional[str] = ForyField(id=-1, nullable=True)
   ```
   
   ### ForyField API
   
   ```python
   def ForyField(
       id: int,
       *,
       nullable: bool = False,
       ref: bool = False,
       default: Any = MISSING,
       default_factory: Callable[[], Any] = MISSING,
   ) -> Any:
       """
       Define a Fory-optimized field for dataclasses.
       
       Args:
           id: Field tag ID for schema evolution mode (REQUIRED).
               - When >= 0: Uses numeric ID instead of field name for compact 
encoding
               - When -1: Explicitly opt-out, use field name with meta string 
encoding
               Must be unique within the class (except -1) and stable across 
versions.
           
           nullable: Whether this field can be None.
               When False (default), Fory skips writing the null flag (saves 1 
byte).
               When True, Fory writes null flag for nullable fields.
               Default: False (aligned with xlang protocol defaults)
           
           ref: Whether to track references for this field.
               When False (default):
               - Avoids adding the object to identity tracking (saves hash 
overhead)
               - Skips writing ref tracking flag
               When True, enables reference tracking for shared/circular 
references.
               Default: False (aligned with xlang protocol defaults)
           
           default: Default value for the field (like dataclasses.field)
           
           default_factory: Factory function for default value (like 
dataclasses.field)
       
       Returns:
           A field descriptor compatible with dataclasses.
       """
   ```
   
   ### Alternative: Type Hint Style
   
   For users who prefer type hints over descriptors:
   
   ```python
   from typing import Annotated
   from pyfory import ForyMeta
   
   @dataclass
   class Foo:
       # Using Annotated for metadata
       f1: Annotated[str, ForyMeta(id=0)]
       f2: Annotated[Optional[str], ForyMeta(id=1, nullable=True)]
       parent: Annotated[Optional[Node], ForyMeta(id=2, ref=True, 
nullable=True)]
   ```
   
   ### Design Decision: Required `id` Field
   
   We chose to make `id` a **required** parameter rather than optional. Here's 
the design rationale:
   
   1. **Explicit control principle**: If a developer uses `ForyField`, they are 
opting into explicit field-level control. Requiring an ID ensures they take 
full ownership of the field's serialization behavior.
   
   2. **Proven pattern**: Protocol Buffers has demonstrated that required field 
numbers work well for schema evolution.
   
   3. **Prevents subtle bugs**: Mixing tagged fields (with ID) and untagged 
fields (using field name) could lead to inconsistent encoding.
   
   4. **Opt-out with `id=-1`**: Users can explicitly opt-out of tag ID encoding 
while still benefiting from `nullable`/`ref` optimizations.
   
   ### Optimization Details
   
   #### 1. `nullable=False` (Default) Optimization
   
   When `nullable=False` (default):
   - Skip writing the null flag entirely (1 byte saved per field)
   - Directly serialize the field value
   - **Raise error** if field value is `None` at runtime
   
   #### 2. `ref=False` (Default) Optimization
   
   When `ref=False` (default):
   - Skip identity tracking lookup/insertion
   - Skip ref flag when combined with `nullable=False`
   - Useful for value types, immutable objects, fields not part of circular 
references
   
   #### 3. `id` (Tag ID) Optimization
   
   When `id >= 0`:
   - Field name is written as an unsigned varint tag ID instead of meta string
   - For `very_long_field_name` (~20 chars), meta string encoding takes ~13 
bytes
   - With tag ID, it takes only 1 byte
   
   **Space savings:**
   
   | Field Name | Meta String (approx) | Tag ID |
   |------------|---------------------|--------|
   | `f1` | ~2 bytes | 1 byte |
   | `user_name` | ~6 bytes | 1 byte |
   | `transaction_id` | ~10 bytes | 1 byte |
   
   ### Implementation Notes
   
   1. **Dataclass Integration**:
      - `ForyField` should return a `dataclasses.field()` compatible object
      - Store Fory metadata in `field.metadata`
      - Support both `default` and `default_factory`
   
   2. **Type Registration**:
      - Parse `ForyField` metadata during class registration
      - Store field metadata in type info
      - Use metadata during serialization/deserialization
   
   3. **Validation**:
      - **Raise error** if `nullable=False` but field value is `None` at runtime
      - **Raise error** if tag IDs (>= 0) are not unique within a class
      - **Raise error** if `id < -1`
   
   4. **Cython Integration**:
      - Metadata should be accessible from both Python and Cython modes
      - Cache field metadata for performance
   
   ### Performance Impact
   
   For a dataclass with 10 fields using default settings (`nullable=False`, 
`ref=False`):
   - **Space savings**: ~20 bytes per object (null + ref flags)
   - **CPU savings**: 10 fewer identity tracking operations per serialization
   
   ## Additional context
   
   This is the Python equivalent of Java's `@ForyField` annotation. See [Java 
issue #3000](https://github.com/apache/fory/issues/3000) for the original 
design discussion.
   
   Protocol spec: 
https://fory.apache.org/docs/specification/fory_xlang_serialization_spec


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [Python] Create ForyField descriptor for dataclass field metadata [fory]

Reply via email to