chaokunyang opened a new issue, #3002:
URL: https://github.com/apache/fory/issues/3002
## Feature Request
Create a `ForyField` descriptor using Python's dataclass field style design
for performance and space optimization during xlang serialization.
## Is your feature request related to a problem? Please describe
Currently, Fory's Python xlang serialization treats all object fields
uniformly:
1. **Null checks are always performed** - Even for fields that are never
null, Fory writes a null/ref flag (1 byte per field)
2. **Reference tracking is always applied** (when enabled globally) - Even
for fields that won't be shared/cyclic, objects are added to identity tracking
with hash lookup cost
3. **Field names use meta string encoding** - In schema evolution mode,
field names are encoded using meta string compression, but for fields with long
names, this still takes space
These defaults ensure correctness but introduce unnecessary overhead when
the developer has more specific knowledge about their data model.
## Describe the solution you'd like
Add a `ForyField` descriptor that integrates with Python's dataclass and
type hint system:
```python
from dataclasses import dataclass
from pyfory import Fory, ForyField
from typing import Optional
@dataclass
class Foo:
# Field f1: non-nullable (default), no ref tracking (default)
# Tag ID 0 provides compact encoding in schema evolution mode
f1: str = ForyField(id=0)
# Field f2: non-nullable (default), no ref tracking (default)
f2: Bar = ForyField(id=1)
# Field f3: nullable field that may contain null values
f3: Optional[str] = ForyField(id=2, nullable=True)
# Field f4: shared reference that needs tracking (e.g., for circular
refs)
parent: Optional[Node] = ForyField(id=3, ref=True, nullable=True)
# Field with long name: tag ID provides significant space savings
very_long_field_name_that_would_take_many_bytes: str = ForyField(id=4)
# Explicit opt-out: use field name encoding but get nullable optimization
optional_field: Optional[str] = ForyField(id=-1, nullable=True)
```
### ForyField API
```python
def ForyField(
id: int,
*,
nullable: bool = False,
ref: bool = False,
default: Any = MISSING,
default_factory: Callable[[], Any] = MISSING,
) -> Any:
"""
Define a Fory-optimized field for dataclasses.
Args:
id: Field tag ID for schema evolution mode (REQUIRED).
- When >= 0: Uses numeric ID instead of field name for compact
encoding
- When -1: Explicitly opt-out, use field name with meta string
encoding
Must be unique within the class (except -1) and stable across
versions.
nullable: Whether this field can be None.
When False (default), Fory skips writing the null flag (saves 1
byte).
When True, Fory writes null flag for nullable fields.
Default: False (aligned with xlang protocol defaults)
ref: Whether to track references for this field.
When False (default):
- Avoids adding the object to identity tracking (saves hash
overhead)
- Skips writing ref tracking flag
When True, enables reference tracking for shared/circular
references.
Default: False (aligned with xlang protocol defaults)
default: Default value for the field (like dataclasses.field)
default_factory: Factory function for default value (like
dataclasses.field)
Returns:
A field descriptor compatible with dataclasses.
"""
```
### Alternative: Type Hint Style
For users who prefer type hints over descriptors:
```python
from typing import Annotated
from pyfory import ForyMeta
@dataclass
class Foo:
# Using Annotated for metadata
f1: Annotated[str, ForyMeta(id=0)]
f2: Annotated[Optional[str], ForyMeta(id=1, nullable=True)]
parent: Annotated[Optional[Node], ForyMeta(id=2, ref=True,
nullable=True)]
```
### Design Decision: Required `id` Field
We chose to make `id` a **required** parameter rather than optional. Here's
the design rationale:
1. **Explicit control principle**: If a developer uses `ForyField`, they are
opting into explicit field-level control. Requiring an ID ensures they take
full ownership of the field's serialization behavior.
2. **Proven pattern**: Protocol Buffers has demonstrated that required field
numbers work well for schema evolution.
3. **Prevents subtle bugs**: Mixing tagged fields (with ID) and untagged
fields (using field name) could lead to inconsistent encoding.
4. **Opt-out with `id=-1`**: Users can explicitly opt-out of tag ID encoding
while still benefiting from `nullable`/`ref` optimizations.
### Optimization Details
#### 1. `nullable=False` (Default) Optimization
When `nullable=False` (default):
- Skip writing the null flag entirely (1 byte saved per field)
- Directly serialize the field value
- **Raise error** if field value is `None` at runtime
#### 2. `ref=False` (Default) Optimization
When `ref=False` (default):
- Skip identity tracking lookup/insertion
- Skip ref flag when combined with `nullable=False`
- Useful for value types, immutable objects, fields not part of circular
references
#### 3. `id` (Tag ID) Optimization
When `id >= 0`:
- Field name is written as an unsigned varint tag ID instead of meta string
- For `very_long_field_name` (~20 chars), meta string encoding takes ~13
bytes
- With tag ID, it takes only 1 byte
**Space savings:**
| Field Name | Meta String (approx) | Tag ID |
|------------|---------------------|--------|
| `f1` | ~2 bytes | 1 byte |
| `user_name` | ~6 bytes | 1 byte |
| `transaction_id` | ~10 bytes | 1 byte |
### Implementation Notes
1. **Dataclass Integration**:
- `ForyField` should return a `dataclasses.field()` compatible object
- Store Fory metadata in `field.metadata`
- Support both `default` and `default_factory`
2. **Type Registration**:
- Parse `ForyField` metadata during class registration
- Store field metadata in type info
- Use metadata during serialization/deserialization
3. **Validation**:
- **Raise error** if `nullable=False` but field value is `None` at runtime
- **Raise error** if tag IDs (>= 0) are not unique within a class
- **Raise error** if `id < -1`
4. **Cython Integration**:
- Metadata should be accessible from both Python and Cython modes
- Cache field metadata for performance
### Performance Impact
For a dataclass with 10 fields using default settings (`nullable=False`,
`ref=False`):
- **Space savings**: ~20 bytes per object (null + ref flags)
- **CPU savings**: 10 fewer identity tracking operations per serialization
## Additional context
This is the Python equivalent of Java's `@ForyField` annotation. See [Java
issue #3000](https://github.com/apache/fory/issues/3000) for the original
design discussion.
Protocol spec:
https://fory.apache.org/docs/specification/fory_xlang_serialization_spec
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]