This is an automated email from the ASF dual-hosted git repository.
chaokunyang pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/fory-site.git
The following commit(s) were added to refs/heads/main by this push:
new ad51adf29 π synced local 'docs/specification/' with remote
'docs/specification/'
ad51adf29 is described below
commit ad51adf296be05e0c15b302a5ec6857f72fcd646
Author: chaokunyang <[email protected]>
AuthorDate: Mon Jan 19 07:57:33 2026 +0000
π synced local 'docs/specification/' with remote 'docs/specification/'
---
docs/specification/java_serialization_spec.md | 766 ++++++++++++-------------
docs/specification/xlang_serialization_spec.md | 547 +++++++-----------
2 files changed, 574 insertions(+), 739 deletions(-)
diff --git a/docs/specification/java_serialization_spec.md
b/docs/specification/java_serialization_spec.md
index c7b8d22d8..ddfbadb3e 100644
--- a/docs/specification/java_serialization_spec.md
+++ b/docs/specification/java_serialization_spec.md
@@ -7,8 +7,8 @@ license: |
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
- (the "License"); you may not use this file except in compliance with
- the License. You may obtain a copy of the License at
+ (the "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
@@ -21,540 +21,488 @@ license: |
## Spec overview
-Apache Foryβ’ Java Serialization is an automatic object serialization framework
that supports reference and polymorphism. Apache Foryβ’
-will
-convert an object from/to fory java serialization binary format. Apache Foryβ’
has two core concepts for java serialization:
+Apache Fory Java serialization is a dynamic binary format for Java object
graphs. It supports
+shared references, circular references, polymorphism, and optional schema
evolution. The format is
+stream friendly: shared type metadata is written inline when needed and there
is no meta start
+offset.
-- **Apache Foryβ’ Java Binary format**
-- **Framework to convert object to/from Apache Foryβ’ Java Binary format**
+The Java native format is an extension of the xlang wire format and reuses the
same core framing
+and encodings; see `docs/specification/xlang_serialization_spec.md` for the
shared baseline.
-The serialization format is a dynamic binary format. The dynamics and
reference/polymorphism support make Apache Foryβ’ flexible,
-much more easy to use, but
-also introduce more complexities compared to static serialization frameworks.
So the format will be more complex.
-
-Here is the overall format:
+Overall layout:
```
-| fory header | object ref meta | object class meta | object value data |
+| fory header | object ref meta | object type meta | object value data |
```
-The data are serialized using little endian byte order overall. If bytes swap
is costly for some object,
-Fory will write the byte order for that object into the data instead of
converting it to little endian.
+All data is encoded in little endian byte order. When running on a big endian
platform, array
+serializers swap byte order on write/read so the on-wire layout remains little
endian.
## Fory header
-Fory header consists starts one byte:
+Java native serialization writes a one byte bitmap header. The header layout
mirrors the xlang
+bitmap and uses the same flag bits.
```
-| 4 bits | 1 bit | 1 bit | 1 bit | 1 bit | optional 4 bytes
|
-+---------------+-------+-------+--------+-------+------------------------------------+
-| reserved bits | oob | xlang | endian | null | unsigned int for meta start
offset |
+| 5 bits | 1 bit | 1 bit | 1 bit |
++--------------+-------+-------+-------+
+| reserved | oob | xlang | null |
```
-- null flag: 1 when object is null, 0 otherwise. If an object is null, other
bits won't be set.
-- endian flag: 1 when data is encoded by little endian, 0 for big endian.
-- xlang flag: 1 when serialization uses xlang format, 0 when serialization
uses Fory java format.
-- oob flag: 1 when passed `BufferCallback` is not null, 0 otherwise.
-
-If meta share mode is enabled, an uncompressed unsigned int is appended to
indicate the start offset of metadata.
-
-## Reference Meta
+- null flag: 1 when object is null, 0 otherwise. If object is null, other bits
are not set.
+- xlang flag: 1 when serialization uses xlang format, 0 when serialization
uses Java native format.
+- oob flag: 1 when `BufferCallback` is not null, 0 otherwise.
-Reference tracking handles whether the object is null, and whether to track
reference for the object by writing
-corresponding flags and maintaining internal state.
+If xlang flag is set, a one byte language ID is written after the bitmap. In
Java native mode (xlang
+flag unset), no language byte is written.
-Reference flags:
+## Reference meta
-| Flag | Byte Value | Description
|
-| ------------------- | ---------- |
-------------------------------------------------------------------------------------------------------------------------------------------------------
|
-| NULL FLAG | `-3` | This flag indicates the object is a null
value. We don't use another byte to indicate REF, so that we can save one byte.
|
-| REF FLAG | `-2` | This flag indicates the object is already
serialized previously, and fory will write a ref id with unsigned varint format
instead of serialize it again |
-| NOT_NULL VALUE FLAG | `-1` | This flag indicates the object is a
non-null value and fory doesn't track ref for this type of object.
|
-| REF VALUE FLAG | `0` | This flag indicates the object is
referencable and the first time to serialize.
|
+Reference tracking uses the same flags as the xlang specification.
-When reference tracking is disabled globally or for specific types, or for
certain types within a particular
-context(e.g., a field of a class), only the `NULL` and `NOT_NULL VALUE` flags
will be used for reference meta.
+| Flag | Byte Value | Description
|
+| ------------------- | ---------- |
--------------------------------------------------------------------------------------------------------
|
+| NULL FLAG | `-3` | Object is null. No further bytes are
written for this object. |
+| REF FLAG | `-2` | Object was already serialized. Followed
by unsigned varint32 reference ID. |
+| NOT_NULL VALUE FLAG | `-1` | Object is non-null but reference tracking
is disabled for this type. Object data follows immediately. |
+| REF VALUE FLAG | `0` | Object is referencable and this is its
first occurrence. Object data follows. Assigns next reference ID. |
-## Class Meta
+When reference tracking is disabled globally or for a specific field/type,
only `NULL FLAG` and
+`NOT_NULL VALUE FLAG` are used.
-Fory supports to register class by an optional id, the registration can be
used for security check and class
-identification.
-If a class is registered, it will have a user-provided or an auto-growing
unsigned int i.e. `class_id`.
+## Type system and type IDs
-Depending on whether meta share mode and registration is enabled for current
class, Fory will write class meta
-differently.
-
-### Schema consistent
-
-If schema consistent mode is enabled globally or enabled for current class,
class meta will be written as follows:
-
-- If class is registered, it will be written as a fory unsigned varint:
`class_id << 1`.
-- If class is not registered:
- - If class is not an array, fory will write one byte `0bxxxxxxx1` first,
then write class name.
- - The first little bit is `1`, which is different from first bit `0` of
- encoded class id. Fory can use this information to determine whether to
read class by class id for
- deserialization.
- - If class is not registered and class is an array, fory will write one byte
`dimensions << 1 | 1` first, then write
- component
- class subsequently. This can reduce array class name cost if component
class is or will be serialized.
- - Class will be written as two enumerated fory unsigned by default: `package
name` and `class name`. If meta share
- mode is
- enabled,
- class will be written as an unsigned varint which points to index in
`MetaContext`.
-
-### Schema evolution
-
-If schema evolution mode is enabled globally or enabled for current class,
class meta will be written as follows:
-
-- If meta share mode is not enabled, class meta will be written as schema
consistent mode. Additionally, field meta such
- as field type
- and name will be written with the field value using a key-value like layout.
-- If meta share mode is enabled, class meta will be written as a meta-share
encoded binary if class hasn't been written
- before, otherwise an unsigned varint id which references to previous written
class meta will be written.
-
-## Meta share
-
-> This mode will forbid streaming writing since it needs to look back for
update the start offset after the whole object
-> graph
-> writing and meta collecting is finished. Only in this way we can ensure
deserialization failure doesn't lost shared
-> meta.
-> Meta streamline will be supported in the future for enclosed meta sharing
which doesn't cross multiple serializations
-> of different objects.
-
-For Schema consistent mode, class will be encoded as an enumerated string by
full class name. Here we mainly describe
-the meta layout for schema evolution mode:
+Java native serialization uses the unified type ID layout shared with xlang:
```
-| 8 bytes global meta header | 1~2 bytes | variable bytes | variable
bytes | variable bytes |
-+-------------------------------+-------------|--------------------+-------------------+----------------+
-| 50 bits hash + 14 bits header | type header | current class meta | parent
class meta | ... |
+full_type_id = (user_type_id << 8) | internal_type_id
```
-Class meta are encoded from parent class to leaf class, only class with
serializable fields will be encoded.
-
-### Global meta header
-
-Meta header is a 64 bits number value encoded in little endian order.
-
-- lower 12 bits are used to encode meta size. If meta size `>=
0b1111_1111_1111`, then write
- `meta_ size - 0b1111_1111_1111` next.
-- 13rd bit is used to indicate whether to write fields meta. When this class
is schema-consistent or use registered
- serializer, fields meta will be skipped. Class Meta will be used for share
namespace + type name only.
-- 14rd bit is used to indicate whether meta is compressed.
-- Other 50 bits is used to store the unique hash of `flags + all layers class
meta`.
-
-### Type header
-
-- Lowest 4 digits `0b0000~0b1110` are used to record num classes. `0b1111` is
preserved to indicate that Fory need to
- read more bytes for length using Fory unsigned int encoding. If current
class doesn't has parent class, or parent
- class doesn't have fields to serialize, or we're in a context which
serialize fields of current class
- only(`ObjectStreamSerializer#SlotInfo` is an example), num classes will be 1.
-- Other 4 bits are preserved to future extensions.
-- If num classes are greater than or equal to `0b1111`, write `num_classes -
0b1111` as varuint next.
-
-### Single layer class meta
-
-```
-| unsigned varint | meta string | meta string |
field info: variable bytes | variable bytes | ... |
-+----------------------------+-----------------------+---------------------+-------------------------------+-----------------+-----+
-| num fields + register flag | header + package name | header + class name |
header + type id + field name | next field info | ... |
-```
-
-- num fields: encode `num fields << 1 | register flag(1 when class
registered)` as unsigned varint.
- - If class is registered, then an unsigned varint class id will be written
next, package and class name will be
- omitted.
- - If current class is schema consistent, then num field will be `0` to flag
it.
- - If current class isn't schema consistent, then num field will be the
number of compatible fields. For example,
- users
- can use tag id to mark some field as compatible field in schema consistent
context. In such cases, schema
- consistent
- fields will be serialized first, then compatible fields will be serialized
next. At deserialization, Fory will use
- fields info of those fields which aren't annotated by tag id for
deserializing schema consistent fields, then use
- fields info in meta for deserializing compatible fields.
-- Package name encoding(omitted when class is registered):
- - encoding algorithm: `UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL`
- - Header: `6 bits size | 2 bits encoding flags`. The `6 bits size: 0~63`
will be used to indicate size `0~63`,
- the value `63` the size need more byte to read, the encoding will encode
`size - 63` as a varint next.
-- Class name encoding(omitted when class is registered):
- - encoding algorithm:
`UTF8/LOWER_UPPER_DIGIT_SPECIAL/FIRST_TO_LOWER_SPECIAL/ALL_TO_LOWER_SPECIAL`
- - header: `6 bits size | 2 bits encoding flags`. The `6 bits size: 0~63`
will be used to indicate size `0~63`,
- the value `63` the size need more byte to read, the encoding will encode
`size - 63` as a varint next.
-- Field info:
- - header(8
- bits): `3 bits size + 2 bits field name encoding + polymorphism flag +
nullability flag + ref tracking flag`.
- Users can use annotation to provide those info.
- - 2 bits field name encoding:
- - encoding: `UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL/TAG_ID`
- - If tag id is used, i.e. field name is written by an unsigned varint
tag id. 2 bits encoding will be `11`.
- - size of field name:
- - The `3 bits size: 0~7` will be used to indicate length `1~7`, the
value `6` the size read more bytes,
- the encoding will encode `size - 7` as a varint next.
- - If encoding is `TAG_ID`, then num_bytes of field name will be used to
store tag id.
- - ref tracking: when set to 1, ref tracking will be enabled for this field.
- - nullability: when set to 1, this field can be null.
- - polymorphism: when set to 1, the actual type of field will be the
declared field type even the type if
- not `final`.
- - type id:
- - For registered type-consistent classes, it will be the registered class
id.
- - Otherwise it will be encoded as `OBJECT_ID` if it isn't `final` and
`FINAL_OBJECT_ID` if it's `final`. The
- meta for such types is written separately instead of inlining here is to
reduce meta space cost if object of
- this type is serialized in current object graph multiple times, and the
field value may be null too.
- - Field name: If type id is set, type id will be used instead. Otherwise
meta string encoding length and data will
- be written instead.
-
-Field order are left as implementation details, which is not exposed to
specification, the deserialization need to
-resort fields based on Fory field comparator. In this way, fory can compute
statistics for field names or types and
-using a more compact encoding.
+- `internal_type_id` is the low 8 bits describing the kind (enum/struct/ext,
named variants, or a
+ built-in type).
+- `user_type_id` is the numeric registration ID (0-based) for user-defined
enum/struct/ext types.
+- Named types use `NAMED_*` internal IDs and carry names in metadata rather
than embedding a user
+ ID.
+
+### Shared internal type IDs (0-30)
+
+Java native mode shares the xlang internal IDs for basic types and
user-defined enum/struct/ext
+tags. These IDs are stable across languages.
+
+| Type ID | Name |
+| ------- | ----------------------- |
+| 0 | UNKNOWN |
+| 1 | BOOL |
+| 2 | INT8 |
+| 3 | INT16 |
+| 4 | INT32 |
+| 5 | VARINT32 |
+| 6 | INT64 |
+| 7 | VARINT64 |
+| 8 | TAGGED_INT64 |
+| 9 | UINT8 |
+| 10 | UINT16 |
+| 11 | UINT32 |
+| 12 | VAR_UINT32 |
+| 13 | UINT64 |
+| 14 | VAR_UINT64 |
+| 15 | TAGGED_UINT64 |
+| 16 | FLOAT16 |
+| 17 | FLOAT32 |
+| 18 | FLOAT64 |
+| 19 | STRING |
+| 20 | LIST |
+| 21 | SET |
+| 22 | MAP |
+| 23 | ENUM |
+| 24 | NAMED_ENUM |
+| 25 | STRUCT |
+| 26 | COMPATIBLE_STRUCT |
+| 27 | NAMED_STRUCT |
+| 28 | NAMED_COMPATIBLE_STRUCT |
+| 29 | EXT |
+| 30 | NAMED_EXT |
+
+### Java native built-in type IDs
+
+Java native serialization assigns Java-specific built-ins starting at
`Types.NAMED_EXT + 1`.
+Type IDs greater than 30 are not shared with xlang; they are only valid in
Java native mode.
+
+| Type ID | Name | Description |
+| ------- | -------------------------- | ------------------------------ |
+| 31 | VOID_ID | java.lang.Void |
+| 32 | CHAR_ID | java.lang.Character |
+| 33 | PRIMITIVE_VOID_ID | void |
+| 34 | PRIMITIVE_BOOL_ID | boolean |
+| 35 | PRIMITIVE_INT8_ID | byte |
+| 36 | PRIMITIVE_CHAR_ID | char |
+| 37 | PRIMITIVE_INT16_ID | short |
+| 38 | PRIMITIVE_INT32_ID | int |
+| 39 | PRIMITIVE_FLOAT32_ID | float |
+| 40 | PRIMITIVE_INT64_ID | long |
+| 41 | PRIMITIVE_FLOAT64_ID | double |
+| 42 | PRIMITIVE_BOOLEAN_ARRAY_ID | boolean[] |
+| 43 | PRIMITIVE_BYTE_ARRAY_ID | byte[] |
+| 44 | PRIMITIVE_CHAR_ARRAY_ID | char[] |
+| 45 | PRIMITIVE_SHORT_ARRAY_ID | short[] |
+| 46 | PRIMITIVE_INT_ARRAY_ID | int[] |
+| 47 | PRIMITIVE_FLOAT_ARRAY_ID | float[] |
+| 48 | PRIMITIVE_LONG_ARRAY_ID | long[] |
+| 49 | PRIMITIVE_DOUBLE_ARRAY_ID | double[] |
+| 50 | STRING_ARRAY_ID | String[] |
+| 51 | OBJECT_ARRAY_ID | Object[] |
+| 52 | ARRAYLIST_ID | java.util.ArrayList |
+| 53 | HASHMAP_ID | java.util.HashMap |
+| 54 | HASHSET_ID | java.util.HashSet |
+| 55 | CLASS_ID | java.lang.Class |
+| 56 | EMPTY_OBJECT_ID | empty object stub |
+| 57 | LAMBDA_STUB_ID | lambda stub |
+| 58 | JDK_PROXY_STUB_ID | JDK proxy stub |
+| 59 | REPLACE_STUB_ID | writeReplace/readResolve stub |
+| 60 | NONEXISTENT_META_SHARED_ID | meta-shared unknown class stub |
+
+### Registration and named types
+
+User-defined enum/struct/ext types can be registered by numeric ID or by name.
+
+- Numeric registration: `full_type_id = (user_id << 8) | internal_type_id`.
+- Name registration: type meta uses namespace and type name (see below).
+- Unregistered types are encoded as named types using namespace = package name
and type name =
+ simple class name.
+
+Named type selection rules for unregistered types:
+
+- enum -> NAMED_ENUM
+- struct-like serializers -> NAMED_STRUCT (or NAMED_COMPATIBLE_STRUCT in
compatible mode)
+- all other custom serializers -> NAMED_EXT
+
+## Type meta encoding
+
+Every value is written with a type ID followed by optional type metadata:
+
+1. Write `type_id` using varuint32 small7 encoding.
+2. For `NAMED_ENUM`, `NAMED_STRUCT`, `NAMED_EXT`, `NAMED_COMPATIBLE_STRUCT`:
+ - If meta share is enabled: write shared class meta (streaming format).
+ - Otherwise: write namespace and type name as meta strings.
+3. For `COMPATIBLE_STRUCT`:
+ - If meta share is enabled: write shared class meta (streaming format).
+ - Otherwise: no extra meta (type ID is sufficient).
+4. All other types: no extra meta.
+
+### Shared class meta (streaming)
+
+When meta share is enabled, Java uses the streaming shared meta protocol and
writes TypeDef
+bytes inline on first use.
-### Other layers class meta
-
-Same encoding algorithm as the previous layer except:
+```
+| varuint32: index_marker | [class def bytes if new] |
-- header + package name:
- - Header:
- - If package name has been written before: `varint index + sharing
flag(set)` will be written
- - If package name hasn't been written before:
- - If meta string encoding is `LOWER_SPECIAL` and the length of encoded
string `<=` 64, then header will be
- `6 bits size + encoding flag(set) + sharing flag(unset)`.
- - Otherwise, header will
- be `3 bits unset + 3 bits encoding flags + encoding flag(unset) +
sharing flag(unset)`
+index_marker = (index << 1) | flag
+flag = 1 -> reference
+flag = 0 -> new type
+```
-## Meta String
+- If `flag == 1`, this is a reference to a previously written type. No class
def bytes follow.
+- If `flag == 0`, this is a new type definition and class def bytes are
written inline.
-Meta string is mainly used to encode meta strings such as class name and field
names.
+The index is assigned sequentially in the order types are first encountered.
-### Encoding Algorithms
+## Schema modes
-String binary encoding algorithm:
+Java native serialization supports two schema modes:
-| Algorithm | Pattern | Description
|
-| ------------------------- | ------------- |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
-| LOWER_SPECIAL | `a-z._$\|` | every char is written using 5
bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101`, prepend one bit at
the start to indicate whether strip last char since last byte may have 7
redundant bits(1 indicates strip last char)
|
-| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9._` | every char is written using 6
bits, `a-z`: `0b00000~0b11001`, `A-Z`: `0b11010~0b110011`, `0~9`:
`0b110100~0b111101`, `._`: `0b111110~0b111111`, prepend one bit at the start to
indicate whether strip last char since last byte may have 7 redundant bits(1
indicates strip last char) |
-| UTF-8 | any chars | UTF-8 encoding
|
+- Schema consistent (compatible mode disabled): fields are serialized in a
fixed order and no
+ ClassDef is required. Type meta uses `STRUCT` or `NAMED_STRUCT` for
user-defined classes.
+- Schema evolution (compatible mode enabled): fields are serialized with
schema evolution metadata
+ (ClassDef). Type meta uses `COMPATIBLE_STRUCT` or `NAMED_COMPATIBLE_STRUCT`.
-Encoding flags:
+## ClassDef format (compatible mode)
-| Encoding Flag | Pattern
| Encoding Algorithm
|
-| ------------------------- |
------------------------------------------------------------- |
-----------------------------------------------------------------------------------------------------------------------------------------------------------
|
-| LOWER_SPECIAL | every char is in `a-z._$\|`
| `LOWER_SPECIAL`
|
-| FIRST_TO_LOWER_SPECIAL | every char is in `a-z[c1,c2]` except first char
is upper case | replace first upper case char to lower case, then use
`LOWER_SPECIAL`
|
-| ALL_TO_LOWER_SPECIAL | every char is in `a-zA-Z[c1,c2]`
| replace every upper case char by `\|` + `lower case`, then use
`LOWER_SPECIAL`, use this encoding if it's smaller than Encoding
`LOWER_UPPER_DIGIT_SPECIAL` |
-| LOWER_UPPER_DIGIT_SPECIAL | every char is in `a-zA-Z[c1,c2]`
| use `LOWER_UPPER_DIGIT_SPECIAL` encoding if it's smaller than
Encoding `FIRST_TO_LOWER_SPECIAL`
|
-| UTF8 | any utf-8 char
| use `UTF-8` encoding
|
-| Compression | any utf-8 char
| lossless compression
|
+ClassDef is the schema evolution metadata encoded for compatible structs. It
is written inline
+when shared meta is enabled, or referenced by index when already seen.
-Notes:
+### Binary layout
-- For package name encoding, `c1,c2` should be `._`; For field/type name
encoding, `c1,c2` should be `_$`;
-- Depending on cases, one can choose encoding `flags + data` jointly, uses 3
bits of first byte for flags and other
- bytes
- for data.
+```
+| 8 bytes header | [varuint32 extra size] | class meta bytes |
+```
-### Shared meta string
+Header layout (lower bits on the right):
-The shared meta string format consists of header and encoded string binary.
Header of encoded string binary will be
-inlined
-in shared meta header.
+```
+| 50-bit hash | 1 bit compress | 1 bit has_fields_meta | 12-bit size |
+```
-Header is written using little endian order, Fory can read this flag first to
determine how to deserialize the data.
+- size: lower 12 bits. If size equals the mask (0xFFF), write extra size as
varuint32 and add it.
+- compress: set when payload is compressed.
+- has_fields_meta: set when field metadata is present.
+- hash: 50-bit hash of the payload and flags.
-#### Write by data
+### Class meta bytes
-If string hasn't been written before, the data will be written as follows:
+Class meta encodes a linearized class hierarchy (from parent to leaf) and
field metadata:
```
-| unsigned varint: string binary size + 1 bit: not written before | 56 bits:
unique hash | 3 bits encoding flags + string binary |
-```
+| num_classes | class_layer_0 | class_layer_1 | ... |
-If string binary size is less than `16` bytes, the hash will be omitted to
save spaces. Unique hash can be omitted too
-if caller pass a flag to disable it. In such cases, the format will be:
-
-```
-| unsigned varint: string binary size + 1 bit: not written before | 3 bits
encoding flags + string binary |
+class_layer:
+| num_fields << 1 | registered_flag | [type_id if registered] |
+| namespace | type_name | field_infos |
```
-#### Write by ref
+- `num_classes` stores `(num_layers - 1)` in a single byte.
+ - If it equals `0b1111`, read an extra varuint32 small7 and add it.
+ - The actual number of layers is `num_classes + 1`.
+- `registered_flag` is 1 if the class is registered by numeric ID.
+- If registered by ID, the class type ID follows (varuint32 small7).
+- If registered by name or unregistered, namespace and type name are written
as meta strings.
-If string has been written before, the data will be written as follows:
+### Field info
+
+Each field uses a compact header followed by its name bytes (omitted when
TAG_ID is used) and its
+type info:
```
-| unsigned varint: written string id + 1 bit: written before |
+| field_header | [field_name_bytes] | field_type |
```
-## Value Format
+`field_header` bits:
-### Basic types
+- bit 0: trackingRef
+- bit 1: nullable
+- bits 2-3: field name encoding
+- bits 4-6: name length (len-1), or tag ID when TAG_ID is used; value 7
indicates extended length
+- bit 7: reserved (0)
-#### Bool
+Field name encoding:
-- size: 1 byte
-- format: 0 for `false`, 1 for `true`
+- 0: UTF8
+- 1: ALL_TO_LOWER_SPECIAL
+- 2: LOWER_UPPER_DIGIT_SPECIAL
+- 3: TAG_ID (field name omitted, tag ID stored in size field)
-#### Byte
+If length is extended (size==7), an extra varuint32 small7 storing `(len-1) -
7` follows.
-- size: 1 byte
-- format: write as pure byte.
+### Field type encoding
-#### Short
+Field types are encoded with a type tag and optional nested type info. For
nested types, the header
+includes nullable/trackingRef flags in the low bits.
+Top-level field types use the tag only (no flags).
-- size: 2 byte
-- byte order: little endian order
+Type tags:
-#### Char
+| Tag | Field type |
+| --- | ----------------------------------------- |
+| 0 | Object (ObjectFieldType) |
+| 1 | Map (MapFieldType) |
+| 2 | Collection/List/Set (CollectionFieldType) |
+| 3 | Array (ArrayFieldType) |
+| 4 | Enum (EnumFieldType) |
+| 5+ | Registered type (RegisteredFieldType) |
-- size: 2 byte
-- byte order: little endian order
+Encoding rules:
-#### Unsigned int
+- ObjectFieldType: write tag 0.
+- MapFieldType: write tag 1, then key type, then value type.
+- CollectionFieldType: write tag 2, then element type.
+- ArrayFieldType: write tag 3, then dimensions, then component type.
+- EnumFieldType: write tag 4.
+- RegisteredFieldType: write tag `5 + type_id`.
-- size: 1~5 byte
-- Format: The most significant bit (MSB) in every byte indicates whether to
have the next byte. If first bit is set
- i.e. `b & 0x80 == 0x80`, then
- the next byte should be read until the first bit of the next byte is unset.
+For nested types, nullable/trackingRef flags are stored in the low bits of the
header as
+`(type_tag << 2) | (nullable << 1) | tracking_ref`.
-#### Signed int
+## Meta string encoding
-- size: 1~5 byte
-- Format: First convert the number into positive unsigned int by `(v << 1) ^
(v >> 31)` ZigZag algorithm, then encoding
- it as an unsigned int.
+Namespace, type names, and field names use the same meta string encodings as
the xlang spec.
-#### Unsigned long
+### Package and type names
-- size: 1~9 byte
-- Fory PVL(Progressive Variable-length Long) Encoding:
- - positive long format: first bit in every byte indicates whether to have
the next byte. If first bit is set
- i.e. `b & 0x80 == 0x80`, then the next byte should be read until the first
bit is unset.
+Header format:
-#### Signed long
+```
+| 6 bits size | 2 bits encoding |
+```
-- size: 1~9 byte
-- Fory SLI(Small long as int) Encoding:
- - If long is in [-1073741824, 1073741823], encode as 4 bytes int: `|
little-endian: ((int) value) << 1 |`
- - Otherwise write as 9 bytes: `| 0b1 | little-endian 8 bytes long |`
-- Fory PVL(Progressive Variable-length Long) Encoding:
- - First convert the number into positive unsigned long by `(v << 1) ^ (v >>
63)` ZigZag algorithm to reduce cost of
- small negative numbers, then encoding it as an unsigned long.
+- size is the byte length of the encoded name.
+- if size == 63, write extra length `(size - 63)` as varuint32 small7.
-#### Float
+Encodings:
-- size: 4 byte
-- format: convert float to 4 bytes int by `Float.floatToRawIntBits`, then
write as binary by little endian order.
+- Package name: UTF8, ALL_TO_LOWER_SPECIAL, LOWER_UPPER_DIGIT_SPECIAL
+- Type name: UTF8, LOWER_UPPER_DIGIT_SPECIAL, FIRST_TO_LOWER_SPECIAL,
ALL_TO_LOWER_SPECIAL
-#### Double
+### Field names
-- size: 8 byte
-- format: convert double to 8 bytes int by `Double.doubleToRawLongBits`, then
write as binary by little endian order.
+Field name encoding is described in the ClassDef field header section. When
using TAG_ID, the
+field name bytes are omitted and the tag ID is stored in the size field.
-### String
+### Encoding algorithms
-Format:
+See the xlang specification for encoding algorithms and tables:
+`docs/specification/xlang_serialization_spec.md#meta-string`.
-```
-| header: size << 2 | 2 bits encoding flags | binary data |
-```
+## Value encodings
-- `size + encoding` will be concat as a long and encoded as an unsigned var
long. The little 2 bits is used for
- encoding:
- 0 for `latin`, 1 for `utf-16`, 2 for `utf-8`.
-- encoded string binary data based on encoding: `latin/utf-16/utf-8`.
+This section describes the byte layouts for common built-in serializers used
in Java native
+serialization. Custom serializers (EXT) may define additional formats but must
still follow the
+reference and type meta rules described above.
-Which encoding to choose:
+### Primitives
-- For JDK8: fory detect `latin` at runtime, if string is `latin` string, then
use `latin` encoding, otherwise
- use `utf-16`.
-- For JDK9+: fory use `coder` in `String` object for encoding,
`latin`/`utf-16` will be used for encoding.
-- If the string is encoded by `utf-8`, then fory will use `utf-8` to decode
the data. But currently fory doesn't enable
- utf-8 encoding by default for java. Cross-language string serialization of
fory uses `utf-8` by default.
+- boolean: 1 byte (0x00 or 0x01).
+- byte: 1 byte.
+- short: 2 bytes little endian.
+- char: 2 bytes little endian (UTF-16 code unit).
+- int:
+ - fixed: 4 bytes little endian.
+ - varint: signed varint32 (ZigZag) when `compressInt` is enabled.
+- long:
+ - fixed: 8 bytes little endian.
+ - varint: signed varint64 (ZigZag) when `longEncoding=VARINT`.
+ - tagged: tagged int64 when `longEncoding=TAGGED`.
+- float: IEEE 754 float32, little endian.
+- double: IEEE 754 float64, little endian.
-### Collection
+Varint encodings follow the xlang spec:
+`docs/specification/xlang_serialization_spec.md#unsigned-varint32`.
-> All collection serializers must extend `CollectionLikeSerializer`.
+### String
-Format:
+Strings are encoded as:
```
-length(unsigned varint) | collection header | elements header | elements data
+| varuint36_small: (num_bytes << 2) | coder | string bytes |
```
-#### Collection header
-
-- For `ArrayList/LinkedArrayList/HashSet/LinkedHashSet`, this will be empty.
-- For `TreeSet`, this will be `Comparator`
-- For subclass of `ArrayList`, this may be extra object field info.
+- coder: 2-bit value
+ - 0: LATIN1
+ - 1: UTF16
+ - 2: UTF8
+- num_bytes: byte length of the encoded string payload.
-#### Elements header
+UTF16 is encoded as little endian 2-byte code units.
-In most cases, all collection elements are same type and not null, elements
header will encode those homogeneous
-information to avoid the cost of writing it for every element. Specifically,
there are four kinds of information
-which will be encoded by elements header, each use one bit:
-
-- If track elements ref, use the first bit `0b1` of the header to flag it.
-- If the collection has null, use the second bit `0b10` of the header to flag
it. If ref tracking is enabled for this
- element type, this flag is invalid.
-- If the collection element types are the declared type, use the 3rd bit
`0b100` of the header to flag it.
-- If the collection element types are same, use the 4th bit `0b1000` header to
flag it.
+### Enum
-By default, all bits are unset, which means all elements won't track ref, all
elements are same type, not null and
-the actual element is the declared type in the custom class field.
+- If `serializeEnumByName` is enabled: write enum name as a meta string.
+- Otherwise: write enum ordinal as varuint32 small7.
-The implementation can generate different deserialization code based read
header, and look up the generated code from a
-linear map/list.
+### Binary (byte[])
-#### Elements data
+Primitive byte arrays are encoded as:
-Based on the elements header, the serialization of elements data may skip `ref
flag`/`null flag`/`element class info`.
+```
+| varuint32: num_bytes | raw bytes |
+```
-`CollectionSerializer#write/read` can be taken as an example.
+### Primitive arrays
-### Array
+Primitive arrays use `writePrimitiveArrayWithSize` unless compression is
enabled:
-#### Primitive array
+```
+| varuint32: byte_length | raw bytes |
+```
-Primitive array are taken as a binary buffer, serialization will just write
the length of array size as an unsigned int,
-then copy the whole buffer into the stream.
+- `compressIntArray`: int[] encoded as `| varuint32: length | varint32... |`.
+- `compressLongArray`: long[] encoded as `| varuint32: length |
varint64/tagged... |`.
-Such serialization won't compress the array. If users want to compress
primitive array, users need to register custom
-serializers for such types.
+### Object arrays
-#### Object array
+Object arrays encode length and a monomorphic flag:
-Object array is serialized using the collection format. Object component type
will be taken as collection element
-generic
-type.
+```
+| varuint32_small7: (length << 1) | mono_flag |
+```
-### Map
+- If `mono_flag == 1`, all elements share a known component serializer. Each
element uses ref
+ flags and the component serializer writes the value.
+- If `mono_flag == 0`, each element uses ref flags and writes its own class
info and data.
-> All Map serializers must extend `MapLikeSerializer`.
+### Collections (List/Set)
-Format:
+Collections encode length and a one-byte elements header:
```
-| length(unsigned varint) | map header | key value pairs data |
+| varuint32_small7: length | elements_header | [elem_class_info] | elements...
|
```
-#### Map header
+`elements_header` bits (see `CollectionFlags`):
-- For `HashMap/LinkedHashMap`, this will be empty.
-- For `TreeMap`, this will be `Comparator`
-- For other `Map`, this may be extra object field info.
+- bit 0: TRACKING_REF
+- bit 1: HAS_NULL
+- bit 2: IS_DECL_ELEMENT_TYPE
+- bit 3: IS_SAME_TYPE
-#### Map Key-Value data
+If `IS_SAME_TYPE` is set and `IS_DECL_ELEMENT_TYPE` is not set, the element
class info is written
+once before the elements. Element values then follow with either ref flags (if
TRACKING_REF) or
+per-element null flags (if HAS_NULL).
-Map iteration is too expensive, Fory won't compute the header like for
collection before since it introduce
-[considerable overhead](https://github.com/apache/fory/issues/925).
-Users can use `MapFieldInfo` annotation to provide header in advance.
Otherwise Fory will use first key-value pair to
-predict header optimistically, and update the chunk header if the prediction
failed at some pair.
+If `IS_SAME_TYPE` is not set, each element is written with its own class info
and data (and
+optionally ref flags).
-Fory will serialize map chunk by chunk, every chunk has 127 pairs at most.
+### Maps
-```
-| 1 byte | 1 byte | variable bytes |
-+----------------+----------------+-----------------+
-| KV header | chunk size: N | N*2 objects |
-```
-
-KV header:
-
-- If track key ref, use the first bit `0b1` of the header to flag it.
-- If the key has null, use the second bit `0b10` of the header to flag it. If
ref tracking is enabled for this
- key type, this flag is invalid.
-- If the actual key type of map is the declared key type, use the 3rd bit
`0b100` of the header to flag it.
-- If track value ref, use the 4th bit `0b1000` of the header to flag it.
-- If the value has null, use the 5th bit `0b10000` of the header to flag it.
If ref tracking is enabled for this
- value type, this flag is invalid.
-- If the value type of map is the declared value type, use the 6rd bit
`0b100000` of the header to flag it.
-- If key or value is null, that key and value will be written as a separate
chunk, and chunk size writing will be
- skipped too.
-
-If streaming write is enabled, which means Fory can't update written `chunk
size`. In such cases, map key-value data
-format will be:
+Maps encode entry count and then a sequence of chunks. Each chunk groups
entries that share key
+and value types.
```
-| 1 byte | variable bytes |
-+----------------+-----------------+
-| KV header | N*2 objects |
-```
-
-`KV header` will be a header marked by `MapFieldInfo` in java. The
implementation can generate different deserialization
-code based read header, and look up the generated code from a linear map/list.
+| varuint32_small7: size | chunk_1 | chunk_2 | ... |
-### Enum
+chunk (non-null entries):
+| header | chunk_size | [key_class_info] | [value_class_info] | entries... |
+```
-Enums are serialized as an unsigned var int. If the order of enum values
change, the deserialized enum value may not be
-the value users expect. In such cases, users must register enum serializer by
make it write enum value as an enumerated
-string with unique hash disabled.
+`header` bits (see `MapFlags`):
-### Object
+- bit 0: TRACKING_KEY_REF
+- bit 1: KEY_HAS_NULL
+- bit 2: KEY_DECL_TYPE
+- bit 3: TRACKING_VALUE_REF
+- bit 4: VALUE_HAS_NULL
+- bit 5: VALUE_DECL_TYPE
-Object means object of `pojo/struct/bean/record` type.
-Object will be serialized by writing its fields data in fory order.
+If `KEY_DECL_TYPE` or `VALUE_DECL_TYPE` is unset, the corresponding class info
is written once at
+the start of the chunk. `chunk_size` is a single byte (1..255) and
`MAX_CHUNK_SIZE` is 255.
-Depending on schema compatibility, objects will have different formats.
+#### Null key/value entries
-#### Field order
+Entries with null key or null value are encoded as special single-entry chunks
without a
+`chunk_size` byte:
-Field will be ordered as following, every group of fields will have its own
order:
+- null key, non-null value: `NULL_KEY_VALUE_DECL_TYPE*` flags, then value
payload
+- null value, non-null key: `NULL_VALUE_KEY_DECL_TYPE*` flags, then key payload
+- null key and null value: `KV_NULL` header only
-- primitive fields: larger size type first, smaller later, variable size type
last.
-- boxed primitive fields: same order as primitive fields
-- final fields: same type together, then sorted by field name
lexicographically.
-- collection fields: same order as final fields
-- map fields: same order as final fields
-- other fields: same order as final fields
+These chunks always represent exactly one entry.
-#### Schema consistent
+### Objects and structs
-Object fields will be serialized one by one using following format:
+Object values are encoded as:
```
-Primitive field value:
-| var bytes |
-+----------------+
-| value data |
-+----------------+
-Boxed field value:
-| one byte | var bytes |
-+-----------+---------------+
-| null flag | field value |
-+-----------+---------------+
-field value of final type with ref tracking:
-| var bytes | var objects |
-+-----------+-------------+
-| ref meta | value data |
-+-----------+-------------+
-field value of final type without ref tracking:
-| one byte | var objects |
-+-----------+-------------+
-| null flag | field value |
-+-----------+-------------+
-field value of non-final type with ref tracking:
-| one byte | var bytes | var objects |
-+-----------+-------------+-------------+
-| ref meta | class meta | value data |
-+-----------+-------------+-------------+
-field value of non-final type without ref tracking:
-| one byte | var bytes | var objects |
-+-----------+------------+------------+
-| null flag | class meta | value data |
-+-----------+------------+------------+
+| ref meta | type meta | field data |
```
-#### Schema evolution
-
-Schema evolution have similar format as schema consistent mode for object
except:
+Field data is written by the serializer selected by the class info. For
standard object
+serialization:
-- For this object type itself, `schema consistent` mode will write class by
id/name, but `schema evolution` mode will
- write class field names, types and other meta too, see [Class
meta](#class-meta).
-- Class meta of `final custom type` needs to be written too, because peers may
not have this class defined.
+- Fields are sorted deterministically using `DescriptorGrouper` order:
+ primitives, boxed primitives, built-ins, collections, maps, then other
fields, with names sorted
+ within each category.
+- For compatible mode, `MetaSharedSerializer` uses ClassDef field metadata to
read and skip
+ unknown fields.
+- For each field, the serializer uses field metadata (nullable, trackingRef,
polymorphic) to decide
+ whether to write ref flags and/or type meta before the field value.
-### Class
+### Extensions (EXT)
-Class will be serialized using class meta format.
+Extension types are encoded by their registered serializer. Type meta is still
written before the
+value as described above. The serializer is responsible for the value layout.
-## Implementation guidelines
+## Out-of-band buffers
-- Try to merge multiple bytes into an int/long write before writing to reduce
memory IO and bound check cost.
-- Read multiple bytes as an int/long, then split into multiple bytes to reduce
memory IO and bound check cost.
-- Try to use one varint/long to write flags and length together to save one
byte cost and reduce memory io.
-- Condition branches are less expensive compared to memory IO cost unless
there are too many branches.
+When a `BufferCallback` is provided, the oob flag is set in the header and
serializers may emit
+buffer references instead of inline bytes (for example, large primitive
arrays). The out-of-band
+buffer protocol is specific to the callback implementation; the main stream
only contains
+references to those buffers.
diff --git a/docs/specification/xlang_serialization_spec.md
b/docs/specification/xlang_serialization_spec.md
index 23623bd37..5882cffc6 100644
--- a/docs/specification/xlang_serialization_spec.md
+++ b/docs/specification/xlang_serialization_spec.md
@@ -148,10 +148,13 @@ Such information can be provided in other languages too:
### Type ID
-All internal data types are expressed using an ID in range `0~64`. Users can
use IDs in range `0~8192` for registering their
-custom types (struct/ext/enum). User type IDs are in a separate namespace and
combined with internal type IDs via bit shifting:
+All internal data types use an 8-bit internal ID (`0~255`, with `0~50` defined
here). Users can
+register types by numeric ID (`0~4095` in current implementations). User IDs
are encoded together
+with the internal type ID:
`(user_type_id << 8) | internal_type_id`.
+Named types (`NAMED_*`) do not embed a user ID; their names are carried in
metadata instead.
+
#### Internal Type ID Table
| Type ID | Name | Description
|
@@ -250,9 +253,9 @@ The data are serialized using little endian byte order for
all types.
Fory header format for xlang serialization:
```
-| 1 byte bitmap | 1 byte | optional 4 bytes
|
-+--------------------------------+------------+------------------------------------+
-| 4 bits reserved | 4 bits meta | language | unsigned int for meta start
offset |
+| 1 byte bitmap | 1 byte |
++--------------------------------+------------+
+| flags | language |
```
Detailed byte layout:
@@ -264,7 +267,6 @@ Byte 0: Bitmap flags
- Bit 2: oob flag (0x04)
- Bits 3-7: reserved
Byte 1: Language ID (only present when xlang flag is set)
-Byte 2-5: Meta start offset (only present when meta share mode is enabled)
```
- **null flag** (bit 0): 1 when object is null, 0 otherwise. If an object is
null, only this flag is set.
@@ -288,10 +290,6 @@ All data is encoded in little-endian format.
| RUST | 6 |
| DART | 7 |
-### Meta Start Offset
-
-If compatible mode is enabled, an uncompressed unsigned int32 (4 bytes, little
endian) is appended to indicate the start offset of metadata. During
serialization, this is initially written as a placeholder (e.g., `-1` or `0`),
then updated after all objects are serialized and metadata is collected.
-
## Reference Meta
Reference tracking handles whether the object is null, and whether to track
reference for the object by writing
@@ -408,254 +406,151 @@ explicit smart pointers (`Rc`, `Arc`).
## Type Meta
-For every type to be serialized, it have a type id to indicate its type.
-
-- basic types: the type id
-- enum:
- - `Type.ENUM` + registered id
- - `Type.NAMED_ENUM` + registered namespace+typename
-- list: `Type.List`
-- set: `Type.SET`
-- map: `Type.MAP`
-- ext:
- - `Type.EXT` + registered id
- - `Type.NAMED_EXT` + registered namespace+typename
-- struct:
- - `Type.STRUCT` + struct meta
- - `Type.NAMED_STRUCT` + struct meta
-
-Every type must be registered with an ID or name first. The registration can
be used for security check and type
-identification.
+Every non-primitive value begins with a type ID that identifies its concrete
type. The type ID is
+followed by optional type-specific metadata.
-Struct is a special type, depending whether schema compatibility is enabled,
Fory will write struct meta
-differently.
+### Type ID encoding
-Only ext/enum/struct can be registered using namespaced type.
+- The type ID is written as an unsigned varint32 (small7).
+- Internal types use their internal type ID directly (low 8 bits).
+- User-registered types use a full type ID: `(user_type_id << 8) |
internal_type_id`.
+ - `user_type_id` is a numeric ID (0-4095 in current implementations).
+ - `internal_type_id` is one of `ENUM`, `STRUCT`, `COMPATIBLE_STRUCT`, or
`EXT`.
+- Named types do not embed a user ID. They use `NAMED_*` internal type IDs and
carry a namespace
+ and type name (or shared TypeDef) instead.
-### Struct Schema consistent
+### Type meta payload
-- If schema consistent mode is enabled globally when creating fory, type meta
will be written as a fory unsigned varint
- of `type_id`. Schema evolution related meta will be ignored.
-- If schema evolution mode is enabled globally when creating fory, and current
class is configured to use schema
- consistent mode like `struct` vs `table` in flatbuffers:
- - Type meta will be add to `captured_type_defs`: `captured_type_defs[type
def stub] = map size` ahead when
- registering type.
- - Get index of the meta in `captured_type_defs`, write that index as `|
unsigned varint: index |`.
+After the type ID:
-### Struct Schema evolution
+- **ENUM / STRUCT / EXT**: no extra bytes (registration by ID required on both
sides).
+- **COMPATIBLE_STRUCT**:
+ - If meta share is enabled, write a shared TypeDef entry (see below).
+ - If meta share is disabled, no extra bytes.
+- **NAMED_ENUM / NAMED_STRUCT / NAMED_COMPATIBLE_STRUCT / NAMED_EXT**:
+ - If meta share is disabled, write `namespace` and `type_name` as meta
strings.
+ - If meta share is enabled, write a shared TypeDef entry (see below).
+- **LIST / SET / MAP / ARRAY / primitives**: no extra bytes at this layer.
-If schema evolution mode is enabled globally when creating fory, and enabled
for current type, type meta will be written
-using one of the following mode. Which mode to use is configured when creating
fory.
+Unregistered types are serialized as named types:
-- Normal mode(meta share not enabled):
- - If type meta hasn't been written before, add `type def`
- to `captured_type_defs`: `captured_type_defs[type def] = map size`.
- - Get index of the meta in `captured_type_defs`, write that index as `|
unsigned varint: index |`.
- - After finished the serialization of the object graph, fory will start to
write `captured_type_defs`:
- - Firstly, set current to `meta start offset` of fory header
- - Then write `captured_type_defs` one by one:
+- Enums -> `NAMED_ENUM`
+- Struct-like classes -> `NAMED_STRUCT` (or `NAMED_COMPATIBLE_STRUCT` when
meta share is enabled)
+- Custom extension types -> `NAMED_EXT`
- ```python
- buffer.write_var_uint32(len(writting_type_defs) -
len(schema_consistent_type_def_stubs))
- for type_meta in writting_type_defs:
- if not type_meta.is_stub():
- type_meta.write_type_def(buffer)
- writing_type_defs = copy(schema_consistent_type_def_stubs)
- ```
+The namespace is the package/module name and the type name is the simple class
name.
-- Meta share mode: the writing steps are same as the normal mode, but
`captured_type_defs` will be shared across
- multiple serializations of different objects. For example, suppose we have a
batch to serialize:
+### Shared Type Meta (streaming)
- ```python
- captured_type_defs = {}
- stream = ...
- # add `Type1` to `captured_type_defs` and write `Type1`
- fory.serialize(stream, [Type1()])
- # add `Type2` to `captured_type_defs` and write `Type2`, `Type1` is written
before.
- fory.serialize(stream, [Type1(), Type2()])
- # `Type1` and `Type2` are written before, no need to write meta.
- fory.serialize(stream, [Type1(), Type2()])
- ```
+When meta share is enabled, TypeDef metadata is written inline the first time
a type is
+encountered, and subsequent occurrences only reference it.
-- Streaming mode(streaming mode doesn't support meta share):
- - If type meta hasn't been written before, the data will be written as:
+Encoding:
- ```
- | unsigned varint: 0b11111111 | type def |
- ```
+- `marker = (index << 1) | flag`
+- `flag = 0`: new type definition follows
+- `flag = 1`: reference to a previously written type definition
+- `index` is the sequential index assigned to this type (starting from 0).
- - If type meta has been written before, the data will be written as:
+Write algorithm:
- ```
- | unsigned varint: written index << 1 |
- ```
+1. Look up the class in the per-stream meta context map.
+2. If found, write `(index << 1) | 1`.
+3. If not found:
+ - assign `index = next_id`
+ - write `(index << 1)`
+ - write the encoded TypeDef bytes immediately after
- `written index` is the id in `captured_type_defs`.
+Read algorithm:
- - With this mode, `meta start offset` can be omitted.
+1. Read `marker` as varuint32.
+2. `flag = marker & 1`, `index = marker >>> 1`.
+3. If `flag == 1`, use the cached TypeDef at `index`.
+4. If `flag == 0`, read a TypeDef, cache it at `index`, and use it.
-> The normal mode and meta share mode will forbid streaming writing since it
needs to look back for update the start
-> offset after the whole object graph writing and meta collecting is finished.
Only in this way we can ensure
-> deserialization failure in meta share mode doesn't lost shared meta.
+TypeDef bytes include the 8-byte global header and optional size extension.
-#### Type Def
+### TypeDef (schema evolution metadata)
-Here we mainly describe the meta layout for schema evolution mode:
+TypeDef describes a struct-like type (or a named enum/ext) for schema
evolution and name
+resolution. It is encoded as:
```
-| 8 bytes header | variable bytes | variable bytes |
-+----------------------+--------------------+-------------------+
-| global binary header | meta header | fields meta |
+| 8-byte global header | [optional size varuint] | TypeDef body |
```
-For languages which support inheritance, if parent class and subclass has
fields with same name, using field in
-subclass.
+#### Global header
-##### Global binary header
+The 8-byte header is a little-endian uint64:
-`50 bits hash + 1bit compress flag + write fields meta + 12 bits meta size`.
Right is the lower bits.
+- Low 12 bits: meta size (number of bytes in the TypeDef body).
+ - If meta size >= 0xFFF, the low 12 bits are set to 0xFFF and an extra
+ `varuint32(meta_size - 0xFFF)` follows immediately after the header.
+- Bit 12: `HAS_FIELDS_META` (1 = fields metadata present).
+- Bit 13: `COMPRESS_META` (1 = body is compressed; decompress before parsing).
+- High 50 bits: hash of the TypeDef body.
-- lower 12 bits are used to encode meta size. If meta size `>=
0b1111_1111_1111`, then write
- `meta_ size - 0b1111_1111_1111` next.
-- 13rd bit is used to indicate whether to write fields meta. When this class
is schema-consistent or use registered
- serializer, fields meta will be skipped. Class Meta will be used for share
namespace + type name only.
-- 14rd bit is used to indicate whether meta is compressed.
-- Other 50 bits is used to store the unique hash of `flags + all layers class
meta`.
+#### TypeDef body
-##### Meta header
-
-Meta header is a 8 bits number value.
-
-- Lowest 5 digits `0b00000~0b11110` are used to record num fields. `0b11111`
is preserved to indicate that Fory need to
- read more bytes for length using Fory unsigned int encoding. Note that
num_fields is the number of compatible fields.
- Users can use tag id to mark some fields as compatible fields in schema
consistent context. In such cases, schema
- consistent fields will be serialized first, then compatible fields will be
serialized next. At deserialization,
- Fory will use fields info of those fields which aren't annotated by tag id
for deserializing schema consistent
- fields, then use fields info in meta for deserializing compatible fields.
-- The 6th bit: 0 for registered by id, 1 for registered by name.
-- Remaining 2 bits are reserved for future extension.
-
-##### Fields meta
-
-Format:
-
-```
-| field info: variable bytes | variable bytes | ... |
-+---------------------------------+-----------------+-----+
-| header + type info + field name | next field info | ... |
-```
-
-###### Field Header
-
-Field Header is 8 bits, annotation can be used to provide more specific info.
If annotation not exists, fory will infer
-those info automatically.
-
-The format for field header is:
+TypeDef body has a single layer (fields are flattened in class hierarchy
order):
```
-2 bits field name encoding + 4 bits size + nullability flag + ref tracking flag
+| meta header (1 byte) | type spec | field info ... |
```
-Detailed spec:
-
-- 2 bits field name encoding:
- - encoding: `UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL/TAG_ID`
- - If tag id is used, field name will be written by an unsigned varint tag
id, and 2 bits encoding will be `11`.
-- size of field name:
- - The `4 bits size: 0~14` will be used to indicate length `1~15`, the value
`15` indicates to read more bytes,
- the encoding will encode `size - 15` as a varint next.
- - If encoding is `TAG_ID`, then num_bytes of field name will be used to
store tag id.
-- ref tracking: when set to 1, ref tracking will be enabled for this field.
-- nullability: when set to 1, this field can be null.
+Meta header byte:
-###### Field Type Info
+- Bits 0-4: `num_fields` (0-30).
+ - If `num_fields == 31`, read an extra `varuint32` and add it.
+- Bit 5: `REGISTER_BY_NAME` (1 = namespace + type name, 0 = numeric type ID).
+- Bits 6-7: reserved.
-Field type info is written as unsigned int8. Detailed id spec is:
+Type spec:
-- For struct registered by id, it will be `Type.STRUCT`.
-- For struct registered by name, it will be `Type.NAMED_STRUCT`.
-- For enum registered by id, it will be `Type.ENUM`.
-- For enum registered by name, it will be `Type.NAMED_ENUM`.
-- For ext type registered by id, it will be `Type.EXT`.
-- For ext type registered by name, it will be `Type.NAMED_EXT`.
-- For list/set type, it will be written as `Type.LIST/SET`, then write element
type recursively.
-- For 1D primitive array type, it will be written as `Type.XXX_ARRAY`.
-- For multi-dimensional primitive array type with same size on each dim, it
will be written as `Type.TENSOR`.
-- For other array type, it will be written as `Type.LIST`, then write element
type recursively.
-- For map type, it will be written as `Type.MAP`, then write key and value
type recursively.
-- For other types supported by fory directly, it will be fory type id for that
type.
-- For other types not determined at compile time, write `Type.UNKNOWN`
instead. For such types, actual type
- will be written when serializing such field values.
+- If `REGISTER_BY_NAME` is set:
+ - `namespace` meta string
+ - `type_name` meta string
+- Otherwise:
+ - `type_id` as `varuint32` (small7)
-Polymorphism spec:
+Field info list:
-- `struct/named_struct/ext/named_ext` are taken as polymorphic, the meta for
those types are written separately
- instead of inlining here to reduce meta space cost if object of this type is
serialized in current object graph
- multiple times, and the field value may be null too.
-- `enum` is taken as dynamic, if deserialization doesn't have this field, or
the type is not enum, enum value
- will be skipped.
-- `list/map/set` are taken as dynamic, when serializing values of those type,
the concrete types won't be written
- again.
-- Other types that fory supported are taken as dynamic too.
-
-List/Set/Map nested type spec:
-
-- `list`: `| list type id | nested type id << 2 + nullability flag + ref
tracking flag | ... multi-layer type info |`
-- `set`: `| set type id | nested type id << 2 + nullability flag + ref
tracking flag | ... multi-layer type info |`
-- `map`: `| set type id | key type info | value type info |`
- - Key type format: `| nested type id << 2 + nullability flag + ref tracking
flag | ... multi-layer type info |`
- - Value type format: `| nested type id << 2 + nullability flag + ref
tracking flag | ... multi-layer type info |`
-
-###### Field Name
-
-If tag id is set, tag id will be used instead. Otherwise meta string of field
name will be written instead.
-
-###### Field order
-
-Field order are left as implementation details, which is not exposed to
specification, the deserialization need to
-resort fields based on Fory fields sort algorithms. In this way, fory can
compute statistics for field names or types and
-using a more compact encoding.
-
-## Extended Type Meta with Inheritance support
-
-If one want to support inheritance for struct, one can implement following
spec.
-
-### Schema consistent
-
-Fields are serialized from parent type to leaf type. Fields are sorted using
fory struct fields sort algorithms.
-
-### Schema Evolution
-
-Meta layout for schema evolution mode:
+Each field is encoded as:
```
-| 8 bytes header | variable bytes | variable bytes | variable bytes
| variable bytes |
-+----------------------+----------------+----------------+--------------------+--------------------+
-| global binary header | meta header | fields meta | parent meta header
| parent fields meta |
+| field header (1 byte) | field type info | [field name bytes] |
```
-#### Meta header
+Field header layout:
-Meta header is a 64 bits number value encoded in little endian order.
+- Bits 6-7: field name encoding (`UTF8`, `ALL_TO_LOWER_SPECIAL`,
+ `LOWER_UPPER_DIGIT_SPECIAL`, or `TAG_ID`)
+- Bits 2-5: size
+ - For name encoding: `size = (name_bytes_length - 1)`
+ - For tag ID: `size = tag_id`
+ - If `size == 0b1111`, read `varuint32(size - 15)` and add it
+- Bit 1: nullable flag
+- Bit 0: reference tracking flag
-- Lowest 4 digits `0b0000~0b1110` are used to record num classes. `0b1111` is
preserved to indicate that Fory need to
- read more bytes for length using Fory unsigned int encoding. If current type
doesn't has parent type, or parent
- type doesn't have fields to serialize, or we're in a context which serialize
fields of current type
- only, num classes will be 1.
-- The 5th bit is used to indicate whether this type needs schema evolution.
-- Other 56 bits are used to store the unique hash of `flags + all layers type
meta`.
+Field type info:
-#### Single layer type meta
+- The top-level field type is written as `varuint32(type_id)` (small7) without
flags.
+- For `LIST` / `SET`, an element type follows, encoded as
+ `(nested_type_id << 2) | (nullable << 1) | tracking_ref`.
+- For `MAP`, key type and value type follow, both encoded the same way.
+- One-dimensional primitive arrays use `*_ARRAY` type IDs; other arrays are
encoded as `LIST`.
-```
-| unsigned varint | var uint | field info: variable bytes | variable bytes
| ... |
-+-----------------+----------+-------------------------------+-----------------+-----+
-| num_fields | type id | header + type id + field name | next field info
| ... |
-```
+Field names:
+
+- If `TAG_ID` encoding is used, no name bytes are written.
+- Otherwise, write the encoded field name bytes as a meta string.
+- For xlang, field names are converted to `snake_case` before encoding for
+ cross-language compatibility.
-#### Other layers type meta
+Field order:
-Same encoding algorithm as the previous layer.
+Field order is implementation-defined. Decoders must match fields by name or
tag ID rather than
+position. Fory uses a stable grouping and sorting order to produce
deterministic TypeDefs.
## Meta String
@@ -1226,16 +1121,10 @@ then copy the whole buffer into the stream.
Such serialization won't compress the array. If users want to compress
primitive array, users need to register custom
serializers for such types or mark it as list type.
-#### Tensor
+#### Multi-dimensional arrays
-Tensor is a special primitive multi-dimensional array which all dimensions
have same size and type. The serialization
-format is:
-
-```
-| num_dims(unsigned varint) | shape[0](unsigned varint) | shape[...] |
shape[N] | element type | data |
-```
-
-The data is continuous to reduce copy and may zero-copy in some cases.
+Xlang does not define a dedicated tensor encoding. Multi-dimensional arrays
are serialized as
+nested lists, while one-dimensional primitive arrays use the `*_ARRAY` type
IDs.
#### object array
@@ -1331,98 +1220,97 @@ Not supported for now.
### struct
-Struct means object of `class/pojo/struct/bean/record` type.
-Struct will be serialized by writing its fields data in fory order.
-
-Depending on schema compatibility, structs will have different formats.
-
-#### field order
-
-Field will be ordered as following, every group of fields will have its own
order:
-
-- primitive fields:
- - larger size type first, smaller later, variable size type last.
- - when same size, sort by type id
- - when same size and type id, sort by snake case field name
- - types: bool/int8/int16/int32/var32/int64/var64/h64/float16/float32/float64
-- nullable primitive fields: same order as primitive fields
-- other internal type fields: sort by type id then snake case field name
-- list fields: sort by snake case field name
-- set fields: sort by snake case field name
-- map fields: sort by snake case field name
-- other fields: sort by snake case field name
-
-If two fields have same type, then sort by snake_case styled field name.
-
-#### schema consistent
-
-Object will be written as:
-
-```
-| 4 byte | variable bytes |
-+---------------+------------------+
-| type hash | field values |
-```
-
-Type hash is used to check the type schema consistency across languages. Type
hash will be the first 32 bits of 56 bits
-value of the type meta.
-
-Object fields will be serialized one by one using following format:
-
-```
-not null primitive field value:
-| var bytes |
-+----------------+
-| value data |
-+----------------+
-nullable primitive field value:
-| one byte | var bytes |
-+-----------+---------------+
-| null flag | field value |
-+-----------+---------------+
-other interal types supported by fory
-| var bytes | var objects |
-+-----------+-------------+
-| null flag | value data |
-+-----------+-------------+
-list field type:
-| one byte | var objects |
-+-----------+-------------+
-| ref meta | value data |
-set field type:
-| one byte | var objects |
-+-----------+-------------+
-| ref meta | value data |
-map field type:
-| one byte | var objects |
-+-----------+-------------+
-| ref meta | value data |
-+-----------+-------------+-------------+
-other types such as enum/struct/ext
-| one byte | var bytes | var objects |
-+-----------+------------+------------+
-| ref flag | type meta | value data |
-+-----------+------------+------------+
-```
-
-Type hash algorithm:
-
-- Sort fields by fields sort algorithm
-- Start with string `""`
-- Iterate every field, append string by:
- - `snow_case(field_name),`. For camelcase name, convert it to snow_case
first.
- - `$type_id,`, for other fields, use type id `TypeId::UNKNOWN` instead.
- - `$nullable;`, `1` if nullable, `0` otherwise.
-- Then convert string to utf8 bytes
-- Compute murmurhash3_x64_128, and use first 32 bits
-
-#### Schema evolution
-
-Schema evolution have similar format as schema consistent mode for object
except:
-
-- For the object type, `schema consistent` mode will write type by id only,
but `schema evolution` mode will
- write type consisting of field names, types and other meta too, see [Type
meta](#type-meta).
-- Type meta of `final custom type` needs to be written too, because peers may
not have this type defined.
+Struct means object of `class/pojo/struct/bean/record` type. Struct values are
serialized by writing
+fields in Fory order. The type meta before the value is written according to
the rules in
+[Type Meta](#type-meta).
+
+#### Field order
+
+Field order must be deterministic and identical across languages. This section
defines the
+language-neutral ordering algorithm; implementations must follow the rules
here rather than any
+language-specific helper classes.
+
+##### Step 1: Field identifier
+
+For every field, compute a stable identifier used for ordering:
+
+- If a tag ID is configured (e.g., `@ForyField(id=...)`), use the tag ID as a
decimal string.
+- Otherwise, use the field name converted to `snake_case`.
+
+Tag IDs must be unique within a type; duplicate tag IDs are invalid.
+
+##### Step 2: Group assignment
+
+Assign each field to exactly one group in the following order:
+
+1. **Primitive (non-nullable)**: primitive or boxed numeric/boolean types with
`nullable=false`.
+2. **Primitive (nullable)**: primitive or boxed numeric/boolean types with
`nullable=true`.
+3. **Built-in (non-container)**: internal type IDs that are not user-defined
and not UNKNOWN,
+ excluding collections and maps (for example: STRING, TIME types, UNION,
primitive arrays).
+4. **Collection**: list/set/object-array fields. Non-primitive arrays are
treated as LIST for
+ ordering purposes.
+5. **Map**: map fields.
+6. **Other**: user-defined enum/struct/ext and UNKNOWN types.
+
+##### Step 3: Intra-group ordering
+
+Within each group, apply the following sort keys in order until a difference
is found:
+
+**Primitive groups (1 and 2):**
+
+1. **Compression category**: fixed-size numeric and boolean types first, then
compressed numeric
+ types (`VARINT32`, `VAR_UINT32`, `VARINT64`, `VAR_UINT64`, `TAGGED_INT64`,
`TAGGED_UINT64`).
+2. **Primitive size** (descending): 8-byte > 4-byte > 2-byte > 1-byte.
+3. **Internal type ID** (descending) as a tie-breaker for equal sizes.
+4. **Field identifier** (lexicographic ascending).
+
+**Built-in / Collection / Map groups (3-5):**
+
+1. **Internal type ID** (ascending).
+2. **Field identifier** (lexicographic ascending).
+
+**Other group (6):**
+
+1. **Field identifier** (lexicographic ascending).
+
+If two fields still compare equal after the rules above, preserve a
deterministic order by
+comparing declaring class name and then the original field name. This
tie-breaker should be
+reachable only in invalid schemas (e.g., duplicate tag IDs).
+
+##### Notes
+
+- The ordering above is used for serialization order and TypeDef field lists.
Schema hashes use
+ the field identifier ordering described in the schema hash section.
+- Collection/map normalization is required so peers with different concrete
types (e.g.,
+ `List` vs `Collection`) still agree on ordering.
+- The compressed numeric rule is critical for cross-language consistency:
compressed integer
+ fields are always placed after all fixed-width integer fields.
+
+#### Schema consistent (meta share disabled)
+
+Object value layout:
+
+```
+| [optional 4-byte schema hash] | field values |
+```
+
+The schema hash is written only when class-version checking is enabled. It is
the low 32 bits of a
+MurmurHash3 x64_128 of the struct fingerprint string:
+
+- For each field, build `<field_id_or_name>,<type_id>,<ref>,<nullable>;`.
+- Field identifier is the tag ID if present, otherwise the snake_case field
name.
+- Sort by field identifier lexicographically before concatenation.
+
+Field values are serialized in Fory order. Primitive fields are written as raw
values (nullable
+primitives include a null flag). Non-primitive fields write ref/null flags as
needed and then the
+value; polymorphic fields include type meta.
+
+#### Compatible mode (meta share enabled)
+
+The field value layout is the same as schema-consistent mode, but the type
meta for
+`COMPATIBLE_STRUCT` and `NAMED_COMPATIBLE_STRUCT` uses shared TypeDef entries.
Deserializers use
+TypeDef to map fields by name or tag ID and to honor nullable/ref flags from
metadata; unknown fields
+are skipped.
### Type
@@ -1526,9 +1414,8 @@ This section provides a step-by-step guide for
implementing Fory xlang serializa
- [ ] Optionally implement Hybrid encoding (TAGGED_INT64/TAGGED_UINT64) for
int64
3. **Header Handling**
- - [ ] Write/read bitmap flags (null, endian, xlang, oob)
- - [ ] Write/read language ID
- - [ ] Handle meta start offset placeholder (for schema evolution)
+ - [ ] Write/read bitmap flags (null, xlang, oob)
+ - [ ] Write/read language ID (when xlang flag is set)
### Phase 2: Basic Type Serializers
@@ -1597,26 +1484,26 @@ Meta strings are required for enum and struct
serialization (encoding field name
- [ ] Generate type IDs: `(user_id << 8) | internal_type_id`
14. **Field Ordering**
- - [ ] Implement Fory field ordering algorithm
- - [ ] Sort primitives by size (larger first), then type ID, then name
- - [ ] Handle nullable vs non-nullable fields
- - [ ] Convert field names to snake_case for sorting
+ - [ ] Implement the spec-defined grouping and ordering
(primitive/boxed/built-in, collections/maps, other)
+ - [ ] Use a stable comparator within each group (type ID and name)
+ - [ ] Use tag ID or snake_case field name as field identifier for
fingerprints
15. **Schema Consistent Mode**
- - [ ] Compute type hash (MurmurHash3 of field info string)
- - [ ] Write 4-byte type hash before fields
+ - [ ] If class-version check is enabled, compute schema hash from field
identifiers
+ - [ ] Write 4-byte schema hash before fields
- [ ] Serialize fields in Fory order
-16. **Schema Evolution Mode** (Optional)
- - [ ] Implement type meta writing
- - [ ] Support field addition/removal
- - [ ] Handle unknown fields (skip during read)
+16. **Compatible/Meta Share Mode**
+ - [ ] Implement shared TypeDef stream (inline new TypeDefs, index
references)
+ - [ ] Map fields by name or tag ID, skip unknown fields
+ - [ ] Apply nullable/ref flags from TypeDef metadata
### Phase 7: Other types
17. **Binary/Array Types**
- - [ ] Primitive arrays (direct buffer copy)
- - [ ] Tensor (multi-dimensional arrays)
+
+- [ ] Primitive arrays (direct buffer copy)
+- [ ] Multi-dimensional arrays as nested lists (no tensor encoding)
### Testing Strategy
@@ -1672,8 +1559,8 @@ Meta strings are required for enum and struct
serialization (encoding field name
1. **Byte Order**: Always use little-endian for multi-byte values
2. **Varint Sign Extension**: Ensure proper handling of signed vs unsigned
varints
3. **Reference ID Ordering**: IDs must be assigned in serialization order
-4. **Field Order Consistency**: Must match exactly across languages (schema
consistent mode only; in evolution mode, deserialization follows serialization
field order from type meta)
+4. **Field Order Consistency**: Must match exactly across languages in
schema-consistent mode; in compatible mode, match by TypeDef field names or tag
IDs
5. **String Encoding**: Use best encoding for current language
6. **Null Handling**: Different languages represent null differently
7. **Empty Collections**: Still write length (0) and header byte
-8. **Type Hash Calculation**: Must use exact same algorithm across languages
+8. **Schema Hash Calculation**: Must use the same fingerprint and MurmurHash3
algorithm across languages when enabled
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]