This is an automated email from the ASF dual-hosted git repository. chaokunyang pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/fory-site.git
commit 56ca0487ad60f93692ba312aa3ebda3e3168a247 Author: chaokunyang <shawn.ck.y...@gmail.com> AuthorDate: Wed Jun 18 15:57:10 2025 +0000 🔄 synced local 'docs/specification/' with remote 'docs/specification/' --- docs/specification/java_serialization_spec.md | 74 +++++++++++++------------- docs/specification/xlang_serialization_spec.md | 66 ++++++++++++----------- 2 files changed, 72 insertions(+), 68 deletions(-) diff --git a/docs/specification/java_serialization_spec.md b/docs/specification/java_serialization_spec.md index 469300bb..3a8d4bbe 100644 --- a/docs/specification/java_serialization_spec.md +++ b/docs/specification/java_serialization_spec.md @@ -66,7 +66,7 @@ corresponding flags and maintaining internal state. Reference flags: | Flag | Byte Value | Description | -|---------------------|------------|---------------------------------------------------------------------------------------------------------------------------------------------------------| +| ------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | | NULL FLAG | `-3` | This flag indicates the object is a null value. We don't use another byte to indicate REF, so that we can save one byte. | | REF FLAG | `-2` | This flag indicates the object is already serialized previously, and fory will write a ref id with unsigned varint format instead of serialize it again | | NOT_NULL VALUE FLAG | `-1` | This flag indicates the object is a non-null value and fory doesn't track ref for this type of object. | @@ -92,15 +92,15 @@ If schema consistent mode is enabled globally or enabled for current class, clas - If class is not registered: - If class is not an array, fory will write one byte `0bxxxxxxx1` first, then write class name. - The first little bit is `1`, which is different from first bit `0` of - encoded class id. Fory can use this information to determine whether to read class by class id for - deserialization. + encoded class id. Fory can use this information to determine whether to read class by class id for + deserialization. - If class is not registered and class is an array, fory will write one byte `dimensions << 1 | 1` first, then write - component - class subsequently. This can reduce array class name cost if component class is or will be serialized. + component + class subsequently. This can reduce array class name cost if component class is or will be serialized. - Class will be written as two enumerated fory unsigned by default: `package name` and `class name`. If meta share - mode is - enabled, - class will be written as an unsigned varint which points to index in `MetaContext`. + mode is + enabled, + class will be written as an unsigned varint which points to index in `MetaContext`. ### Schema evolution @@ -162,45 +162,45 @@ Meta header is a 64 bits number value encoded in little endian order. - num fields: encode `num fields << 1 | register flag(1 when class registered)` as unsigned varint. - If class is registered, then an unsigned varint class id will be written next, package and class name will be - omitted. + omitted. - If current class is schema consistent, then num field will be `0` to flag it. - If current class isn't schema consistent, then num field will be the number of compatible fields. For example, - users - can use tag id to mark some field as compatible field in schema consistent context. In such cases, schema - consistent - fields will be serialized first, then compatible fields will be serialized next. At deserialization, Fory will use - fields info of those fields which aren't annotated by tag id for deserializing schema consistent fields, then use - fields info in meta for deserializing compatible fields. + users + can use tag id to mark some field as compatible field in schema consistent context. In such cases, schema + consistent + fields will be serialized first, then compatible fields will be serialized next. At deserialization, Fory will use + fields info of those fields which aren't annotated by tag id for deserializing schema consistent fields, then use + fields info in meta for deserializing compatible fields. - Package name encoding(omitted when class is registered): - encoding algorithm: `UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL` - - Header: `6 bits size | 2 bits encoding flags`. The `6 bits size: 0~63` will be used to indicate size `0~63`, - the value `63` the size need more byte to read, the encoding will encode `size - 63` as a varint next. + - Header: `6 bits size | 2 bits encoding flags`. The `6 bits size: 0~63` will be used to indicate size `0~63`, + the value `63` the size need more byte to read, the encoding will encode `size - 63` as a varint next. - Class name encoding(omitted when class is registered): - encoding algorithm: `UTF8/LOWER_UPPER_DIGIT_SPECIAL/FIRST_TO_LOWER_SPECIAL/ALL_TO_LOWER_SPECIAL` - - header: `6 bits size | 2 bits encoding flags`. The `6 bits size: 0~63` will be used to indicate size `0~63`, - the value `63` the size need more byte to read, the encoding will encode `size - 63` as a varint next. + - header: `6 bits size | 2 bits encoding flags`. The `6 bits size: 0~63` will be used to indicate size `0~63`, + the value `63` the size need more byte to read, the encoding will encode `size - 63` as a varint next. - Field info: - header(8 - bits): `3 bits size + 2 bits field name encoding + polymorphism flag + nullability flag + ref tracking flag`. - Users can use annotation to provide those info. + bits): `3 bits size + 2 bits field name encoding + polymorphism flag + nullability flag + ref tracking flag`. + Users can use annotation to provide those info. - 2 bits field name encoding: - encoding: `UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL/TAG_ID` - If tag id is used, i.e. field name is written by an unsigned varint tag id. 2 bits encoding will be `11`. - size of field name: - - The `3 bits size: 0~7` will be used to indicate length `1~7`, the value `6` the size read more bytes, - the encoding will encode `size - 7` as a varint next. + - The `3 bits size: 0~7` will be used to indicate length `1~7`, the value `6` the size read more bytes, + the encoding will encode `size - 7` as a varint next. - If encoding is `TAG_ID`, then num_bytes of field name will be used to store tag id. - ref tracking: when set to 1, ref tracking will be enabled for this field. - nullability: when set to 1, this field can be null. - polymorphism: when set to 1, the actual type of field will be the declared field type even the type if - not `final`. + not `final`. - type id: - For registered type-consistent classes, it will be the registered class id. - Otherwise it will be encoded as `OBJECT_ID` if it isn't `final` and `FINAL_OBJECT_ID` if it's `final`. The - meta for such types is written separately instead of inlining here is to reduce meta space cost if object of - this type is serialized in current object graph multiple times, and the field value may be null too. + meta for such types is written separately instead of inlining here is to reduce meta space cost if object of + this type is serialized in current object graph multiple times, and the field value may be null too. - Field name: If type id is set, type id will be used instead. Otherwise meta string encoding length and data will - be written instead. + be written instead. Field order are left as implementation details, which is not exposed to specification, the deserialization need to resort fields based on Fory field comparator. In this way, fory can compute statistics for field names or types and @@ -215,9 +215,9 @@ Same encoding algorithm as the previous layer except: - If package name has been written before: `varint index + sharing flag(set)` will be written - If package name hasn't been written before: - If meta string encoding is `LOWER_SPECIAL` and the length of encoded string `<=` 64, then header will be - `6 bits size + encoding flag(set) + sharing flag(unset)`. + `6 bits size + encoding flag(set) + sharing flag(unset)`. - Otherwise, header will - be `3 bits unset + 3 bits encoding flags + encoding flag(unset) + sharing flag(unset)` + be `3 bits unset + 3 bits encoding flags + encoding flag(unset) + sharing flag(unset)` ## Meta String @@ -227,16 +227,16 @@ Meta string is mainly used to encode meta strings such as class name and field n String binary encoding algorithm: -| Algorithm | Pattern | Description | -|---------------------------|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| LOWER_SPECIAL | `a-z._$\|` | every char is written using 5 bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101`, prepend one bit at the start to indicate whether strip last char since last byte may have 7 redundant bits(1 indicates strip last char) | -| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9._` | every char is written using 6 bits, `a-z`: `0b00000~0b11001`, `A-Z`: `0b11010~0b110011`, `0~9`: `0b110100~0b111101`, `._`: `0b111110~0b111111`, prepend one bit at the start to indicate whether strip last char since last byte may have 7 redundant bits(1 indicates strip last char) | -| UTF-8 | any chars | UTF-8 encoding | +| Algorithm | Pattern | Description | +| ------------------------- | ------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| LOWER_SPECIAL | `a-z._$\|` | every char is written using 5 bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101`, prepend one bit at the start to indicate whether strip last char since last byte may have 7 redundant bits(1 indicates strip last char) | +| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9._` | every char is written using 6 bits, `a-z`: `0b00000~0b11001`, `A-Z`: `0b11010~0b110011`, `0~9`: `0b110100~0b111101`, `._`: `0b111110~0b111111`, prepend one bit at the start to indicate whether strip last char since last byte may have 7 redundant bits(1 indicates strip last char) | +| UTF-8 | any chars | UTF-8 encoding | Encoding flags: | Encoding Flag | Pattern | Encoding Algorithm | -|---------------------------|---------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------| +| ------------------------- | ------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | | LOWER_SPECIAL | every char is in `a-z._$\|` | `LOWER_SPECIAL` | | FIRST_TO_LOWER_SPECIAL | every char is in `a-z[c1,c2]` except first char is upper case | replace first upper case char to lower case, then use `LOWER_SPECIAL` | | ALL_TO_LOWER_SPECIAL | every char is in `a-zA-Z[c1,c2]` | replace every upper case char by `\|` + `lower case`, then use `LOWER_SPECIAL`, use this encoding if it's smaller than Encoding `LOWER_UPPER_DIGIT_SPECIAL` | @@ -324,7 +324,7 @@ If string has been written before, the data will be written as follows: - size: 1~9 byte - Fory PVL(Progressive Variable-length Long) Encoding: - positive long format: first bit in every byte indicates whether to have the next byte. If first bit is set - i.e. `b & 0x80 == 0x80`, then the next byte should be read until the first bit is unset. + i.e. `b & 0x80 == 0x80`, then the next byte should be read until the first bit is unset. #### Signed long @@ -334,7 +334,7 @@ If string has been written before, the data will be written as follows: - Otherwise write as 9 bytes: `| 0b1 | little-endian 8 bytes long |` - Fory PVL(Progressive Variable-length Long) Encoding: - First convert the number into positive unsigned long by `(v << 1) ^ (v >> 63)` ZigZag algorithm to reduce cost of - small negative numbers, then encoding it as an unsigned long. + small negative numbers, then encoding it as an unsigned long. #### Float diff --git a/docs/specification/xlang_serialization_spec.md b/docs/specification/xlang_serialization_spec.md index aeaa0f8d..debfaf92 100644 --- a/docs/specification/xlang_serialization_spec.md +++ b/docs/specification/xlang_serialization_spec.md @@ -191,7 +191,7 @@ corresponding flags and maintaining internal state. Reference flags: | Flag | Byte Value | Description | -|---------------------|------------|---------------------------------------------------------------------------------------------------------------------------------------------------------| +| ------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | | NULL FLAG | `-3` | This flag indicates the object is a null value. We don't use another byte to indicate REF, so that we can save one byte. | | REF FLAG | `-2` | This flag indicates the object is already serialized previously, and fory will write a ref id with unsigned varint format instead of serialize it again | | NOT_NULL VALUE FLAG | `-1` | This flag indicates the object is a non-null value and fory doesn't track ref for this type of object. | @@ -243,7 +243,7 @@ differently. - If schema evolution mode is enabled globally when creating fory, and current class is configured to use schema consistent mode like `struct` vs `table` in flatbuffers: - Type meta will be add to `captured_type_defs`: `captured_type_defs[type def stub] = map size` ahead when - registering type. + registering type. - Get index of the meta in `captured_type_defs`, write that index as `| unsigned varint: index |`. ### Struct Schema evolution @@ -252,10 +252,12 @@ If schema evolution mode is enabled globally when creating fory, and enabled for using one of the following mode. Which mode to use is configured when creating fory. - Normal mode(meta share not enabled): + - If type meta hasn't been written before, add `type def` - to `captured_type_defs`: `captured_type_defs[type def] = map size`. + to `captured_type_defs`: `captured_type_defs[type def] = map size`. - Get index of the meta in `captured_type_defs`, write that index as `| unsigned varint: index |`. - After finished the serialization of the object graph, fory will start to write `captured_type_defs`: + - Firstly, set current to `meta start offset` of fory header - Then write `captured_type_defs` one by one: @@ -270,31 +272,33 @@ using one of the following mode. Which mode to use is configured when creating f - Meta share mode: the writing steps are same as the normal mode, but `captured_type_defs` will be shared across multiple serializations of different objects. For example, suppose we have a batch to serialize: - ```python - captured_type_defs = {} - stream = ... - # add `Type1` to `captured_type_defs` and write `Type1` - fory.serialize(stream, [Type1()]) - # add `Type2` to `captured_type_defs` and write `Type2`, `Type1` is written before. - fory.serialize(stream, [Type1(), Type2()]) - # `Type1` and `Type2` are written before, no need to write meta. - fory.serialize(stream, [Type1(), Type2()]) - ``` + ```python + captured_type_defs = {} + stream = ... + # add `Type1` to `captured_type_defs` and write `Type1` + fory.serialize(stream, [Type1()]) + # add `Type2` to `captured_type_defs` and write `Type2`, `Type1` is written before. + fory.serialize(stream, [Type1(), Type2()]) + # `Type1` and `Type2` are written before, no need to write meta. + fory.serialize(stream, [Type1(), Type2()]) + ``` - Streaming mode(streaming mode doesn't support meta share): + - If type meta hasn't been written before, the data will be written as: - ``` - | unsigned varint: 0b11111111 | type def | - ``` + ``` + | unsigned varint: 0b11111111 | type def | + ``` - If type meta has been written before, the data will be written as: - ``` - | unsigned varint: written index << 1 | - ``` + ``` + | unsigned varint: written index << 1 | + ``` + + `written index` is the id in `captured_type_defs`. - `written index` is the id in `captured_type_defs`. - With this mode, `meta start offset` can be omitted. > The normal mode and meta share mode will forbid streaming writing since it > needs to look back for update the start @@ -365,8 +369,8 @@ Detailed spec: - encoding: `UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL/TAG_ID` - If tag id is used, field name will be written by an unsigned varint tag id, and 2 bits encoding will be `11`. - size of field name: - - The `4 bits size: 0~14` will be used to indicate length `1~15`, the value `15` indicates to read more bytes, - the encoding will encode `size - 15` as a varint next. + - The `4 bits size: 0~14` will be used to indicate length `1~15`, the value `15` indicates to read more bytes, + the encoding will encode `size - 15` as a varint next. - If encoding is `TAG_ID`, then num_bytes of field name will be used to store tag id. - ref tracking: when set to 1, ref tracking will be enabled for this field. - nullability: when set to 1, this field can be null. @@ -411,7 +415,7 @@ List/Set/Map nested type spec: ###### Field Name -If tag id is set, tag id will be used instead. Otherwise meta string of field name will be written instead. +If tag id is set, tag id will be used instead. Otherwise meta string of field name will be written instead. ###### Field order @@ -468,16 +472,16 @@ Meta string is mainly used to encode meta strings such as field names. String binary encoding algorithm: -| Algorithm | Pattern | Description | -|---------------------------|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| LOWER_SPECIAL | `a-z._$\|` | every char is written using 5 bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101`, prepend one bit at the start to indicate whether strip last char since last byte may have 7 redundant bits(1 indicates strip last char) | -| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9._` | every char is written using 6 bits, `a-z`: `0b00000~0b11001`, `A-Z`: `0b11010~0b110011`, `0~9`: `0b110100~0b111101`, `._`: `0b111110~0b111111`, prepend one bit at the start to indicate whether strip last char since last byte may have 7 redundant bits(1 indicates strip last char) | -| UTF-8 | any chars | UTF-8 encoding | +| Algorithm | Pattern | Description | +| ------------------------- | ------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| LOWER_SPECIAL | `a-z._$\|` | every char is written using 5 bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101`, prepend one bit at the start to indicate whether strip last char since last byte may have 7 redundant bits(1 indicates strip last char) | +| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9._` | every char is written using 6 bits, `a-z`: `0b00000~0b11001`, `A-Z`: `0b11010~0b110011`, `0~9`: `0b110100~0b111101`, `._`: `0b111110~0b111111`, prepend one bit at the start to indicate whether strip last char since last byte may have 7 redundant bits(1 indicates strip last char) | +| UTF-8 | any chars | UTF-8 encoding | Encoding flags: | Encoding Flag | Pattern | Encoding Algorithm | -|---------------------------|----------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------| +| ------------------------- | -------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | | LOWER_SPECIAL | every char is in `a-z._\|` | `LOWER_SPECIAL` | | FIRST_TO_LOWER_SPECIAL | every char is in `a-z._` except first char is upper case | replace first upper case char to lower case, then use `LOWER_SPECIAL` | | ALL_TO_LOWER_SPECIAL | every char is in `a-zA-Z._` | replace every upper case char by `\|` + `lower case`, then use `LOWER_SPECIAL`, use this encoding if it's smaller than Encoding `LOWER_UPPER_DIGIT_SPECIAL` | @@ -546,7 +550,7 @@ Notes: - Otherwise write as 9 bytes: `| 0b1 | little-endian 8 bytes long |` - Fory PVL(Progressive Variable-length Long) Encoding: - positive long format: first bit in every byte indicates whether to have the next byte. If first bit is set - i.e. `b & 0x80 == 0x80`, then the next byte should be read until the first bit is unset. + i.e. `b & 0x80 == 0x80`, then the next byte should be read until the first bit is unset. #### signed int64 @@ -561,7 +565,7 @@ Notes: - Otherwise write as 9 bytes: `| 0b1 | little-endian 8 bytes long |` - Fory PVL(Progressive Variable-length Long) Encoding: - First convert the number into positive unsigned long by `(v << 1) ^ (v >> 63)` ZigZag algorithm to reduce cost of - small negative numbers, then encoding it as an unsigned long. + small negative numbers, then encoding it as an unsigned long. #### float32 --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@fory.apache.org For additional commands, e-mail: commits-h...@fory.apache.org