This is an automated email from the ASF dual-hosted git repository. chaokunyang pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/fury-site.git
commit 752aab97d9a0d099270ad14d285b662c72d0656e Author: chaokunyang <[email protected]> AuthorDate: Sat Aug 17 15:03:08 2024 +0000 🔄 synced local 'docs/specification/' with remote 'docs/specification/' --- docs/specification/java_serialization_spec.md | 72 ++++++++-------- docs/specification/row_format_spec.md | 1 + docs/specification/xlang_serialization_spec.md | 112 ++++++++++++------------- 3 files changed, 91 insertions(+), 94 deletions(-) diff --git a/docs/specification/java_serialization_spec.md b/docs/specification/java_serialization_spec.md index 5bb6511..592413a 100644 --- a/docs/specification/java_serialization_spec.md +++ b/docs/specification/java_serialization_spec.md @@ -1,9 +1,11 @@ --- -title: Fury Java Serialization Specification +title: Fury Java Serialization Format sidebar_position: 1 id: fury_java_serialization_spec --- +# Fury Java Serialization Specification + ## Spec overview Fury Java Serialization is an automatic object serialization framework that supports reference and polymorphism. Fury @@ -75,14 +77,14 @@ If schema consistent mode is enabled globally or enabled for current class, clas - If class is registered, it will be written as a fury unsigned varint: `class_id << 1`. - If class is not registered: - - If class is not an array, fury will write one byte `0bxxxxxxx1` first, then write class name. - - The first little bit is `1`, which is different from first bit `0` of + - If class is not an array, fury will write one byte `0bxxxxxxx1` first, then write class name. + - The first little bit is `1`, which is different from first bit `0` of encoded class id. Fury can use this information to determine whether to read class by class id for deserialization. - - If class is not registered and class is an array, fury will write one byte `dimensions << 1 | 1` first, then write + - If class is not registered and class is an array, fury will write one byte `dimensions << 1 | 1` first, then write component class subsequently. This can reduce array class name cost if component class is or will be serialized. - - Class will be written as two enumerated fury unsigned by default: `package name` and `class name`. If meta share + - Class will be written as two enumerated fury unsigned by default: `package name` and `class name`. If meta share mode is enabled, class will be written as an unsigned varint which points to index in `MetaContext`. @@ -143,10 +145,10 @@ Meta header is a 64 bits number value encoded in little endian order. ``` - num fields: encode `num fields << 1 | register flag(1 when class registered)` as unsigned varint. - - If class is registered, then an unsigned varint class id will be written next, package and class name will be + - If class is registered, then an unsigned varint class id will be written next, package and class name will be omitted. - - If current class is schema consistent, then num field will be `0` to flag it. - - If current class isn't schema consistent, then num field will be the number of compatible fields. For example, + - If current class is schema consistent, then num field will be `0` to flag it. + - If current class isn't schema consistent, then num field will be the number of compatible fields. For example, users can use tag id to mark some field as compatible field in schema consistent context. In such cases, schema consistent @@ -154,34 +156,34 @@ Meta header is a 64 bits number value encoded in little endian order. fields info of those fields which aren't annotated by tag id for deserializing schema consistent fields, then use fields info in meta for deserializing compatible fields. - Package name encoding(omitted when class is registered): - - encoding algorithm: `UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL` - - Header: `6 bits size | 2 bits encoding flags`. The `6 bits size: 0~63` will be used to indicate size `0~62`, + - encoding algorithm: `UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL` + - Header: `6 bits size | 2 bits encoding flags`. The `6 bits size: 0~63` will be used to indicate size `0~62`, the value `63` the size need more byte to read, the encoding will encode `size - 62` as a varint next. - Class name encoding(omitted when class is registered): - - encoding algorithm: `UTF8/LOWER_UPPER_DIGIT_SPECIAL/FIRST_TO_LOWER_SPECIAL/ALL_TO_LOWER_SPECIAL` - - header: `6 bits size | 2 bits encoding flags`. The `6 bits size: 0~63` will be used to indicate size `1~64`, + - encoding algorithm: `UTF8/LOWER_UPPER_DIGIT_SPECIAL/FIRST_TO_LOWER_SPECIAL/ALL_TO_LOWER_SPECIAL` + - header: `6 bits size | 2 bits encoding flags`. The `6 bits size: 0~63` will be used to indicate size `1~64`, the value `63` the size need more byte to read, the encoding will encode `size - 63` as a varint next. - Field info: - - header(8 + - header(8 bits): `3 bits size + 2 bits field name encoding + polymorphism flag + nullability flag + ref tracking flag`. Users can use annotation to provide those info. - - 2 bits field name encoding: - - encoding: `UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL/TAG_ID` - - If tag id is used, i.e. field name is written by an unsigned varint tag id. 2 bits encoding will be `11`. - - size of field name: - - The `3 bits size: 0~7` will be used to indicate length `1~7`, the value `6` the size read more bytes, + - 2 bits field name encoding: + - encoding: `UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL/TAG_ID` + - If tag id is used, i.e. field name is written by an unsigned varint tag id. 2 bits encoding will be `11`. + - size of field name: + - The `3 bits size: 0~7` will be used to indicate length `1~7`, the value `6` the size read more bytes, the encoding will encode `size - 7` as a varint next. - - If encoding is `TAG_ID`, then num_bytes of field name will be used to store tag id. - - ref tracking: when set to 1, ref tracking will be enabled for this field. - - nullability: when set to 1, this field can be null. - - polymorphism: when set to 1, the actual type of field will be the declared field type even the type if + - If encoding is `TAG_ID`, then num_bytes of field name will be used to store tag id. + - ref tracking: when set to 1, ref tracking will be enabled for this field. + - nullability: when set to 1, this field can be null. + - polymorphism: when set to 1, the actual type of field will be the declared field type even the type if not `final`. - - type id: - - For registered type-consistent classes, it will be the registered class id. - - Otherwise it will be encoded as `OBJECT_ID` if it isn't `final` and `FINAL_OBJECT_ID` if it's `final`. The + - type id: + - For registered type-consistent classes, it will be the registered class id. + - Otherwise it will be encoded as `OBJECT_ID` if it isn't `final` and `FINAL_OBJECT_ID` if it's `final`. The meta for such types is written separately instead of inlining here is to reduce meta space cost if object of this type is serialized in current object graph multiple times, and the field value may be null too. - - Field name: If type id is set, type id will be used instead. Otherwise meta string encoding length and data will + - Field name: If type id is set, type id will be used instead. Otherwise meta string encoding length and data will be written instead. Field order are left as implementation details, which is not exposed to specification, the deserialization need to @@ -193,12 +195,12 @@ using a more compact encoding. Same encoding algorithm as the previous layer except: - header + package name: - - Header: - - If package name has been written before: `varint index + sharing flag(set)` will be written - - If package name hasn't been written before: - - If meta string encoding is `LOWER_SPECIAL` and the length of encoded string `<=` 64, then header will be + - Header: + - If package name has been written before: `varint index + sharing flag(set)` will be written + - If package name hasn't been written before: + - If meta string encoding is `LOWER_SPECIAL` and the length of encoded string `<=` 64, then header will be `6 bits size + encoding flag(set) + sharing flag(unset)`. - - Otherwise, header will + - Otherwise, header will be `3 bits unset + 3 bits encoding flags + encoding flag(unset) + sharing flag(unset)` ## Meta String @@ -305,17 +307,17 @@ If string has been written before, the data will be written as follows: - size: 1~9 byte - Fury PVL(Progressive Variable-length Long) Encoding: - - positive long format: first bit in every byte indicates whether to have the next byte. If first bit is set + - positive long format: first bit in every byte indicates whether to have the next byte. If first bit is set i.e. `b & 0x80 == 0x80`, then the next byte should be read until the first bit is unset. #### Signed long - size: 1~9 byte - Fury SLI(Small long as int) Encoding: - - If long is in [-1073741824, 1073741823], encode as 4 bytes int: `| little-endian: ((int) value) << 1 |` - - Otherwise write as 9 bytes: `| 0b1 | little-endian 8 bytes long |` + - If long is in [-1073741824, 1073741823], encode as 4 bytes int: `| little-endian: ((int) value) << 1 |` + - Otherwise write as 9 bytes: `| 0b1 | little-endian 8 bytes long |` - Fury PVL(Progressive Variable-length Long) Encoding: - - First convert the number into positive unsigned long by `(v << 1) ^ (v >> 63)` ZigZag algorithm to reduce cost of + - First convert the number into positive unsigned long by ` (v << 1) ^ (v >> 63)` ZigZag algorithm to reduce cost of small negative numbers, then encoding it as an unsigned long. #### Float diff --git a/docs/specification/row_format_spec.md b/docs/specification/row_format_spec.md index af24879..d32c11a 100644 --- a/docs/specification/row_format_spec.md +++ b/docs/specification/row_format_spec.md @@ -4,4 +4,5 @@ sidebar_position: 2 id: fury_row_format_spec --- +# Row Format Coming soon diff --git a/docs/specification/xlang_serialization_spec.md b/docs/specification/xlang_serialization_spec.md index 5f39a08..5882d00 100644 --- a/docs/specification/xlang_serialization_spec.md +++ b/docs/specification/xlang_serialization_spec.md @@ -1,11 +1,12 @@ --- -title: Cross-language Serialization Specification +title: Fury Xlang Serialization Format sidebar_position: 0 id: fury_xlang_serialization_spec --- +# Cross-language Serialization Specification + > Format Version History: -> > - Version 0.1 - serialization spec formalized Fury xlang serialization is an automatic object serialization framework that supports reference and polymorphism. @@ -41,22 +42,22 @@ also introduce more complexities compared to static serialization frameworks. So - set: an unordered set of unique elements. - map: a map of key-value pairs. Mutable types such as `list/map/set/array/tensor/arrow` are not allowed as key of map. - time types: - - duration: an absolute length of time, independent of any calendar/timezone, as a count of nanoseconds. - - timestamp: a point in time, independent of any calendar/timezone, as a count of nanoseconds. The count is relative + - duration: an absolute length of time, independent of any calendar/timezone, as a count of nanoseconds. + - timestamp: a point in time, independent of any calendar/timezone, as a count of nanoseconds. The count is relative to an epoch at UTC midnight on January 1, 1970. - decimal: exact decimal value represented as an integer value in two's complement. - binary: an variable-length array of bytes. - array type: only allow numeric components. Other arrays will be taken as List. The implementation should support the interoperability between array and list. - - array: multidimensional array which every sub-array can have different sizes but all have same type. - - bool_array: one dimensional int16 array. - - int8_array: one dimensional int8 array. - - int16_array: one dimensional int16 array. - - int32_array: one dimensional int32 array. - - int64_array: one dimensional int64 array. - - float16_array: one dimensional half_float_16 array. - - float32_array: one dimensional float32 array. - - float64_array: one dimensional float64 array. + - array: multidimensional array which every sub-array can have different sizes but all have same type. + - bool_array: one dimensional int16 array. + - int8_array: one dimensional int8 array. + - int16_array: one dimensional int16 array. + - int32_array: one dimensional int32 array. + - int64_array: one dimensional int64 array. + - float16_array: one dimensional half_float_16 array. + - float32_array: one dimensional float32 array. + - float64_array: one dimensional float64 array. - tensor: a multidimensional dense array of fixed-size values such as a NumPy ndarray. - sparse tensor: a multidimensional array whose elements are almost all zeros. - arrow record batch: an arrow [record batch](https://arrow.apache.org/docs/cpp/tables.html#record-batches) object. @@ -196,9 +197,9 @@ differently. of `type_id`. Schema evolution related meta will be ignored. - If schema evolution mode is enabled globally when creating fury, and current class is configured to use schema consistent mode like `struct` vs `table` in flatbuffers: - - Type meta will be add to `captured_type_defs`: `captured_type_defs[type def stub] = map size` ahead when + - Type meta will be add to `captured_type_defs`: `captured_type_defs[type def stub] = map size` ahead when registering type. - - Get index of the meta in `captured_type_defs`, write that index as `| unsigned varint: index |`. + - Get index of the meta in `captured_type_defs`, write that index as `| unsigned varint: index |`. ### Schema evolution @@ -206,24 +207,21 @@ If schema evolution mode is enabled globally when creating fury, and enabled for using one of the following mode. Which mode to use is configured when creating fury. - Normal mode(meta share not enabled): - - If type meta hasn't been written before, add `type def` + - If type meta hasn't been written before, add `type def` to `captured_type_defs`: `captured_type_defs[type def] = map size`. - - Get index of the meta in `captured_type_defs`, write that index as `| unsigned varint: index |`. - - After finished the serialization of the object graph, fury will start to write `captured_type_defs`: - - Firstly, set current to `meta start offset` of fury header - - Then write `captured_type_defs` one by one: - - ```python - buffer.write_var_uint32(len(writting_type_defs) - len(schema_consistent_type_def_stubs)) - for type_meta in writting_type_defs: - if not type_meta.is_stub(): - type_meta.write_type_def(buffer) - writing_type_defs = copy(schema_consistent_type_def_stubs) - ``` - + - Get index of the meta in `captured_type_defs`, write that index as `| unsigned varint: index |`. + - After finished the serialization of the object graph, fury will start to write `captured_type_defs`: + - Firstly, set current to `meta start offset` of fury header + - Then write `captured_type_defs` one by one: + ```python + buffer.write_var_uint32(len(writting_type_defs) - len(schema_consistent_type_def_stubs)) + for type_meta in writting_type_defs: + if not type_meta.is_stub(): + type_meta.write_type_def(buffer) + writing_type_defs = copy(schema_consistent_type_def_stubs) + ``` - Meta share mode: the writing steps are same as the normal mode, but `captured_type_defs` will be shared across multiple serializations of different objects. For example, suppose we have a batch to serialize: - ```python captured_type_defs = {} stream = ... @@ -236,20 +234,16 @@ using one of the following mode. Which mode to use is configured when creating f ``` - Streaming mode(streaming mode doesn't support meta share): - - If type meta hasn't been written before, the data will be written as: - + - If type meta hasn't been written before, the data will be written as: ``` | unsigned varint: 0b11111111 | type def | ``` - - - If type meta has been written before, the data will be written as: - + - If type meta has been written before, the data will be written as: ``` | unsigned varint: written index << 1 | ``` - `written index` is the id in `captured_type_defs`. - - With this mode, `meta start offset` can be omitted. + - With this mode, `meta start offset` can be omitted. > The normal mode and meta share mode will forbid streaming writing since it > needs to look back for update the start > offset after the whole object graph writing and meta collecting is finished. > Only in this way we can ensure @@ -287,33 +281,33 @@ Meta header is a 64 bits number value encoded in little endian order. ``` - num fields: encode `num fields` as unsigned varint. - - If the current type is schema consistent, then num_fields will be `0` to flag it. - - If the current type isn't schema consistent, then num_fields will be the number of compatible fields. For example, + - If the current type is schema consistent, then num_fields will be `0` to flag it. + - If the current type isn't schema consistent, then num_fields will be the number of compatible fields. For example, users can use tag id to mark some fields as compatible fields in schema consistent context. In such cases, schema consistent fields will be serialized first, then compatible fields will be serialized next. At deserialization, Fury will use fields info of those fields which aren't annotated by tag id for deserializing schema consistent fields, then use fields info in meta for deserializing compatible fields. - type id: the registered id for the current type, which will be written as an unsigned varint. - field info: - - header(8 + - header(8 bits): `3 bits size + 2 bits field name encoding + polymorphism flag + nullability flag + ref tracking flag`. Users can use annotation to provide those info. - - 2 bits field name encoding: - - encoding: `UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL/TAG_ID` - - If tag id is used, i.e. field name is written by an unsigned varint tag id. 2 bits encoding will be `11`. - - size of field name: - - The `3 bits size: 0~7` will be used to indicate length `1~7`, the value `7` indicates to read more bytes, + - 2 bits field name encoding: + - encoding: `UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL/TAG_ID` + - If tag id is used, i.e. field name is written by an unsigned varint tag id. 2 bits encoding will be `11`. + - size of field name: + - The `3 bits size: 0~7` will be used to indicate length `1~7`, the value `7` indicates to read more bytes, the encoding will encode `size - 7` as a varint next. - - If encoding is `TAG_ID`, then num_bytes of field name will be used to store tag id. - - ref tracking: when set to 1, ref tracking will be enabled for this field. - - nullability: when set to 1, this field can be null. - - polymorphism: when set to 1, the actual type of field will be the declared field type even the type if + - If encoding is `TAG_ID`, then num_bytes of field name will be used to store tag id. + - ref tracking: when set to 1, ref tracking will be enabled for this field. + - nullability: when set to 1, this field can be null. + - polymorphism: when set to 1, the actual type of field will be the declared field type even the type if not `final`. - - field name: If tag id is set, tag id will be used instead. Otherwise meta string encoding `[length]` and data will + - field name: If tag id is set, tag id will be used instead. Otherwise meta string encoding `[length]` and data will be written instead. - - type id: - - For registered type-consistent classes, it will be the registered type id. - - Otherwise it will be encoded as `OBJECT_ID` if it isn't `final` and `FINAL_OBJECT_ID` if it's `final`. The + - type id: + - For registered type-consistent classes, it will be the registered type id. + - Otherwise it will be encoded as `OBJECT_ID` if it isn't `final` and `FINAL_OBJECT_ID` if it's `final`. The meta for such types is written separately instead of inlining here is to reduce meta space cost if object of this type is serialized in current object graph multiple times, and the field value may be null too. @@ -407,10 +401,10 @@ Notes: - size: 1~9 byte - Fury SLI(Small long as int) Encoding: - - If long is in `[0, 2147483647]`, encode as 4 bytes int: `| little-endian: ((int) value) << 1 |` - - Otherwise write as 9 bytes: `| 0b1 | little-endian 8 bytes long |` + - If long is in `[0, 2147483647]`, encode as 4 bytes int: `| little-endian: ((int) value) << 1 |` + - Otherwise write as 9 bytes: `| 0b1 | little-endian 8 bytes long |` - Fury PVL(Progressive Variable-length Long) Encoding: - - positive long format: first bit in every byte indicates whether to have the next byte. If first bit is set + - positive long format: first bit in every byte indicates whether to have the next byte. If first bit is set i.e. `b & 0x80 == 0x80`, then the next byte should be read until the first bit is unset. #### signed int64 @@ -422,10 +416,10 @@ Notes: - size: 1~9 byte - Fury SLI(Small long as int) Encoding: - - If long is in `[-1073741824, 1073741823]`, encode as 4 bytes int: `| little-endian: ((int) value) << 1 |` - - Otherwise write as 9 bytes: `| 0b1 | little-endian 8 bytes long |` + - If long is in `[-1073741824, 1073741823]`, encode as 4 bytes int: `| little-endian: ((int) value) << 1 |` + - Otherwise write as 9 bytes: `| 0b1 | little-endian 8 bytes long |` - Fury PVL(Progressive Variable-length Long) Encoding: - - First convert the number into positive unsigned long by `(v << 1) ^ (v >> 63)` ZigZag algorithm to reduce cost of + - First convert the number into positive unsigned long by `(v << 1) ^ (v >> 63)` ZigZag algorithm to reduce cost of small negative numbers, then encoding it as an unsigned long. #### float32 --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
