(fury-site) 02/03: 🔄 synced local 'docs/specification/' with remote 'docs/specification/'

chaokunyang Sat, 17 Aug 2024 08:04:02 -0700

This is an automated email from the ASF dual-hosted git repository.

chaokunyang pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/fury-site.git


commit 752aab97d9a0d099270ad14d285b662c72d0656e
Author: chaokunyang <[email protected]>
AuthorDate: Sat Aug 17 15:03:08 2024 +0000

    🔄 synced local 'docs/specification/' with remote 'docs/specification/'
---
 docs/specification/java_serialization_spec.md  |  72 ++++++++--------
 docs/specification/row_format_spec.md          |   1 +
 docs/specification/xlang_serialization_spec.md | 112 ++++++++++++-------------
 3 files changed, 91 insertions(+), 94 deletions(-)

diff --git a/docs/specification/java_serialization_spec.md 
b/docs/specification/java_serialization_spec.md
index 5bb6511..592413a 100644
--- a/docs/specification/java_serialization_spec.md
+++ b/docs/specification/java_serialization_spec.md
@@ -1,9 +1,11 @@
 ---
-title: Fury Java Serialization Specification
+title: Fury Java Serialization Format
 sidebar_position: 1
 id: fury_java_serialization_spec
 ---
 
+# Fury Java Serialization Specification
+
 ## Spec overview
 
 Fury Java Serialization is an automatic object serialization framework that 
supports reference and polymorphism. Fury
@@ -75,14 +77,14 @@ If schema consistent mode is enabled globally or enabled 
for current class, clas
 
 - If class is registered, it will be written as a fury unsigned varint: 
`class_id << 1`.
 - If class is not registered:
-  - If class is not an array, fury will write one byte `0bxxxxxxx1` first, 
then write class name.
-    - The first little bit is `1`, which is different from first bit `0` of
+    - If class is not an array, fury will write one byte `0bxxxxxxx1` first, 
then write class name.
+        - The first little bit is `1`, which is different from first bit `0` of
           encoded class id. Fury can use this information to determine whether 
to read class by class id for
           deserialization.
-  - If class is not registered and class is an array, fury will write one byte 
`dimensions << 1 | 1` first, then write
+    - If class is not registered and class is an array, fury will write one 
byte `dimensions << 1 | 1` first, then write
       component
       class subsequently. This can reduce array class name cost if component 
class is or will be serialized.
-  - Class will be written as two enumerated fury unsigned by default: `package 
name` and `class name`. If meta share
+    - Class will be written as two enumerated fury unsigned by default: 
`package name` and `class name`. If meta share
       mode is
       enabled,
       class will be written as an unsigned varint which points to index in 
`MetaContext`.
@@ -143,10 +145,10 @@ Meta header is a 64 bits number value encoded in little 
endian order.
 ```
 
 - num fields: encode `num fields << 1 | register flag(1 when class 
registered)` as unsigned varint.
-  - If class is registered, then an unsigned varint class id will be written 
next, package and class name will be
+    - If class is registered, then an unsigned varint class id will be written 
next, package and class name will be
       omitted.
-  - If current class is schema consistent, then num field will be `0` to flag 
it.
-  - If current class isn't schema consistent, then num field will be the 
number of compatible fields. For example,
+    - If current class is schema consistent, then num field will be `0` to 
flag it.
+    - If current class isn't schema consistent, then num field will be the 
number of compatible fields. For example,
       users
       can use tag id to mark some field as compatible field in schema 
consistent context. In such cases, schema
       consistent
@@ -154,34 +156,34 @@ Meta header is a 64 bits number value encoded in little 
endian order.
       fields info of those fields which aren't annotated by tag id for 
deserializing schema consistent fields, then use
       fields info in meta for deserializing compatible fields.
 - Package name encoding(omitted when class is registered):
-  - encoding algorithm: `UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL`
-  - Header: `6 bits size | 2 bits encoding flags`. The `6 bits size: 0~63`  
will be used to indicate size `0~62`,
+    - encoding algorithm: `UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL`
+    - Header: `6 bits size | 2 bits encoding flags`. The `6 bits size: 0~63`  
will be used to indicate size `0~62`,
       the value `63` the size need more byte to read, the encoding will encode 
`size - 62` as a varint next.
 - Class name encoding(omitted when class is registered):
-  - encoding algorithm: 
`UTF8/LOWER_UPPER_DIGIT_SPECIAL/FIRST_TO_LOWER_SPECIAL/ALL_TO_LOWER_SPECIAL`
-  - header: `6 bits size | 2 bits encoding flags`. The `6 bits size: 0~63`  
will be used to indicate size `1~64`,
+    - encoding algorithm: 
`UTF8/LOWER_UPPER_DIGIT_SPECIAL/FIRST_TO_LOWER_SPECIAL/ALL_TO_LOWER_SPECIAL`
+    - header: `6 bits size | 2 bits encoding flags`. The `6 bits size: 0~63`  
will be used to indicate size `1~64`,
       the value `63` the size need more byte to read, the encoding will encode 
`size - 63` as a varint next.
 - Field info:
-  - header(8
+    - header(8
       bits): `3 bits size + 2 bits field name encoding + polymorphism flag + 
nullability flag + ref tracking flag`.
       Users can use annotation to provide those info.
-    - 2 bits field name encoding:
-      - encoding: `UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL/TAG_ID`
-      - If tag id is used, i.e. field name is written by an unsigned varint 
tag id. 2 bits encoding will be `11`.
-    - size of field name:
-      - The `3 bits size: 0~7`  will be used to indicate length `1~7`, the 
value `6` the size read more bytes,
+        - 2 bits field name encoding:
+            - encoding: 
`UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL/TAG_ID`
+            - If tag id is used, i.e. field name is written by an unsigned 
varint tag id. 2 bits encoding will be `11`.
+        - size of field name:
+            - The `3 bits size: 0~7`  will be used to indicate length `1~7`, 
the value `6` the size read more bytes,
               the encoding will encode `size - 7` as a varint next.
-      - If encoding is `TAG_ID`, then num_bytes of field name will be used to 
store tag id.
-    - ref tracking: when set to 1, ref tracking will be enabled for this field.
-    - nullability: when set to 1, this field can be null.
-    - polymorphism: when set to 1, the actual type of field will be the 
declared field type even the type if
+            - If encoding is `TAG_ID`, then num_bytes of field name will be 
used to store tag id.
+        - ref tracking: when set to 1, ref tracking will be enabled for this 
field.
+        - nullability: when set to 1, this field can be null.
+        - polymorphism: when set to 1, the actual type of field will be the 
declared field type even the type if
           not `final`.
-  - type id:
-    - For registered type-consistent classes, it will be the registered class 
id.
-    - Otherwise it will be encoded as `OBJECT_ID` if it isn't `final` and 
`FINAL_OBJECT_ID` if it's `final`. The
+    - type id:
+        - For registered type-consistent classes, it will be the registered 
class id.
+        - Otherwise it will be encoded as `OBJECT_ID` if it isn't `final` and 
`FINAL_OBJECT_ID` if it's `final`. The
           meta for such types is written separately instead of inlining here 
is to reduce meta space cost if object of
           this type is serialized in current object graph multiple times, and 
the field value may be null too.
-  - Field name: If type id is set, type id will be used instead. Otherwise 
meta string encoding length and data will
+    - Field name: If type id is set, type id will be used instead. Otherwise 
meta string encoding length and data will
       be written instead.
 
 Field order are left as implementation details, which is not exposed to 
specification, the deserialization need to
@@ -193,12 +195,12 @@ using a more compact encoding.
 Same encoding algorithm as the previous layer except:
 
 - header + package name:
-  - Header:
-    - If package name has been written before: `varint index + sharing 
flag(set)` will be written
-    - If package name hasn't been written before:
-      - If meta string encoding is `LOWER_SPECIAL` and the length of encoded 
string `<=` 64, then header will be
+    - Header:
+        - If package name has been written before: `varint index + sharing 
flag(set)` will be written
+        - If package name hasn't been written before:
+            - If meta string encoding is `LOWER_SPECIAL` and the length of 
encoded string `<=` 64, then header will be
               `6 bits size + encoding flag(set) + sharing flag(unset)`.
-      - Otherwise, header will
+            - Otherwise, header will
               be `3 bits unset + 3 bits encoding flags + encoding flag(unset) 
+ sharing flag(unset)`
 
 ## Meta String
@@ -305,17 +307,17 @@ If string has been written before, the data will be 
written as follows:
 
 - size: 1~9 byte
 - Fury PVL(Progressive Variable-length Long) Encoding:
-  - positive long format: first bit in every byte indicates whether to have 
the next byte. If first bit is set
+    - positive long format: first bit in every byte indicates whether to have 
the next byte. If first bit is set
       i.e. `b & 0x80 == 0x80`, then the next byte should be read until the 
first bit is unset.
 
 #### Signed long
 
 - size: 1~9 byte
 - Fury SLI(Small long as int) Encoding:
-  - If long is in [-1073741824, 1073741823], encode as 4 bytes int: `| 
little-endian: ((int) value) << 1 |`
-  - Otherwise write as 9 bytes: `| 0b1 | little-endian 8 bytes long |`
+    - If long is in [-1073741824, 1073741823], encode as 4 bytes int: `| 
little-endian: ((int) value) << 1 |`
+    - Otherwise write as 9 bytes: `| 0b1 | little-endian 8 bytes long |`
 - Fury PVL(Progressive Variable-length Long) Encoding:
-  - First convert the number into positive unsigned long by `(v << 1) ^ (v >> 
63)` ZigZag algorithm to reduce cost of
+    - First convert the number into positive unsigned long by ` (v << 1) ^ (v 
>> 63)` ZigZag algorithm to reduce cost of
       small negative numbers, then encoding it as an unsigned long.
 
 #### Float
diff --git a/docs/specification/row_format_spec.md 
b/docs/specification/row_format_spec.md
index af24879..d32c11a 100644
--- a/docs/specification/row_format_spec.md
+++ b/docs/specification/row_format_spec.md
@@ -4,4 +4,5 @@ sidebar_position: 2
 id: fury_row_format_spec
 ---
 
+# Row Format
 Coming soon
diff --git a/docs/specification/xlang_serialization_spec.md 
b/docs/specification/xlang_serialization_spec.md
index 5f39a08..5882d00 100644
--- a/docs/specification/xlang_serialization_spec.md
+++ b/docs/specification/xlang_serialization_spec.md
@@ -1,11 +1,12 @@
 ---
-title: Cross-language Serialization Specification
+title: Fury Xlang Serialization Format
 sidebar_position: 0
 id: fury_xlang_serialization_spec
 ---
 
+# Cross-language Serialization Specification
+
 > Format Version History:
->
 > - Version 0.1 - serialization spec formalized
 
 Fury xlang serialization is an automatic object serialization framework that 
supports reference and polymorphism.
@@ -41,22 +42,22 @@ also introduce more complexities compared to static 
serialization frameworks. So
 - set: an unordered set of unique elements.
 - map: a map of key-value pairs. Mutable types such as 
`list/map/set/array/tensor/arrow` are not allowed as key of map.
 - time types:
-  - duration: an absolute length of time, independent of any 
calendar/timezone, as a count of nanoseconds.
-  - timestamp: a point in time, independent of any calendar/timezone, as a 
count of nanoseconds. The count is relative
+    - duration: an absolute length of time, independent of any 
calendar/timezone, as a count of nanoseconds.
+    - timestamp: a point in time, independent of any calendar/timezone, as a 
count of nanoseconds. The count is relative
       to an epoch at UTC midnight on January 1, 1970.
 - decimal: exact decimal value represented as an integer value in two's 
complement.
 - binary: an variable-length array of bytes.
 - array type: only allow numeric components. Other arrays will be taken as 
List. The implementation should support the
   interoperability between array and list.
-  - array: multidimensional array which every sub-array can have different 
sizes but all have same type.
-  - bool_array: one dimensional int16 array.
-  - int8_array: one dimensional int8 array.
-  - int16_array: one dimensional int16 array.
-  - int32_array: one dimensional int32 array.
-  - int64_array: one dimensional int64 array.
-  - float16_array: one dimensional half_float_16 array.
-  - float32_array: one dimensional float32 array.
-  - float64_array: one dimensional float64 array.
+    - array: multidimensional array which every sub-array can have different 
sizes but all have same type.
+    - bool_array: one dimensional int16 array.
+    - int8_array: one dimensional int8 array.
+    - int16_array: one dimensional int16 array.
+    - int32_array: one dimensional int32 array.
+    - int64_array: one dimensional int64 array.
+    - float16_array: one dimensional half_float_16 array.
+    - float32_array: one dimensional float32 array.
+    - float64_array: one dimensional float64 array.
 - tensor: a multidimensional dense array of fixed-size values such as a NumPy 
ndarray.
 - sparse tensor: a multidimensional array whose elements are almost all zeros.
 - arrow record batch: an arrow [record 
batch](https://arrow.apache.org/docs/cpp/tables.html#record-batches) object.
@@ -196,9 +197,9 @@ differently.
   of `type_id`. Schema evolution related meta will be ignored.
 - If schema evolution mode is enabled globally when creating fury, and current 
class is configured to use schema
   consistent mode like `struct` vs `table` in flatbuffers:
-  - Type meta will be add to `captured_type_defs`: `captured_type_defs[type 
def stub] = map size` ahead when
+    - Type meta will be add to `captured_type_defs`: `captured_type_defs[type 
def stub] = map size` ahead when
       registering type.
-  - Get index of the meta in `captured_type_defs`, write that index as `| 
unsigned varint: index |`.
+    - Get index of the meta in `captured_type_defs`, write that index as `| 
unsigned varint: index |`.
 
 ### Schema evolution
 
@@ -206,24 +207,21 @@ If schema evolution mode is enabled globally when 
creating fury, and enabled for
 using one of the following mode. Which mode to use is configured when creating 
fury.
 
 - Normal mode(meta share not enabled):
-  - If type meta hasn't been written before, add `type def`
+    - If type meta hasn't been written before, add `type def`
       to `captured_type_defs`: `captured_type_defs[type def] = map size`.
-  - Get index of the meta in `captured_type_defs`, write that index as `| 
unsigned varint: index |`.
-  - After finished the serialization of the object graph, fury will start to 
write `captured_type_defs`:
-    - Firstly, set current to `meta start offset` of fury header
-    - Then write `captured_type_defs` one by one:
-
-      ```python
-      buffer.write_var_uint32(len(writting_type_defs) - 
len(schema_consistent_type_def_stubs))
-      for type_meta in writting_type_defs:
-          if not type_meta.is_stub():
-              type_meta.write_type_def(buffer)
-      writing_type_defs = copy(schema_consistent_type_def_stubs)
-      ```
-
+    - Get index of the meta in `captured_type_defs`, write that index as `| 
unsigned varint: index |`.
+    - After finished the serialization of the object graph, fury will start to 
write `captured_type_defs`:
+        - Firstly, set current to `meta start offset` of fury header
+        - Then write `captured_type_defs` one by one:
+          ```python
+          buffer.write_var_uint32(len(writting_type_defs) - 
len(schema_consistent_type_def_stubs))
+          for type_meta in writting_type_defs:
+              if not type_meta.is_stub():
+                  type_meta.write_type_def(buffer)
+          writing_type_defs = copy(schema_consistent_type_def_stubs)
+          ```
 - Meta share mode: the writing steps are same as the normal mode, but 
`captured_type_defs` will be shared across
   multiple serializations of different objects. For example, suppose we have a 
batch to serialize:
-
     ```python
     captured_type_defs = {}
     stream = ...
@@ -236,20 +234,16 @@ using one of the following mode. Which mode to use is 
configured when creating f
     ```
 
 - Streaming mode(streaming mode doesn't support meta share):
-  - If type meta hasn't been written before, the data will be written as:
-
+    - If type meta hasn't been written before, the data will be written as:
       ```
       | unsigned varint: 0b11111111 | type def |
       ```
-
-  - If type meta has been written before, the data will be written as:
-
+    - If type meta has been written before, the data will be written as:
       ```
       | unsigned varint: written index << 1 |
       ```
-
       `written index` is the id in `captured_type_defs`.
-  - With this mode, `meta start offset` can be omitted.
+    - With this mode, `meta start offset` can be omitted.
 
 > The normal mode and meta share mode will forbid streaming writing since it 
 > needs to look back for update the start
 > offset after the whole object graph writing and meta collecting is finished. 
 > Only in this way we can ensure
@@ -287,33 +281,33 @@ Meta header is a 64 bits number value encoded in little 
endian order.
 ```
 
 - num fields: encode `num fields` as unsigned varint.
-  - If the current type is schema consistent, then num_fields will be `0` to 
flag it.
-  - If the current type isn't schema consistent, then num_fields will be the 
number of compatible fields. For example,
+    - If the current type is schema consistent, then num_fields will be `0` to 
flag it.
+    - If the current type isn't schema consistent, then num_fields will be the 
number of compatible fields. For example,
       users can use tag id to mark some fields as compatible fields in schema 
consistent context. In such cases, schema
       consistent fields will be serialized first, then compatible fields will 
be serialized next. At deserialization,
       Fury will use fields info of those fields which aren't annotated by tag 
id for deserializing schema consistent
       fields, then use fields info in meta for deserializing compatible fields.
 - type id: the registered id for the current type, which will be written as an 
unsigned varint.
 - field info:
-  - header(8
+    - header(8
       bits): `3 bits size + 2 bits field name encoding + polymorphism flag + 
nullability flag + ref tracking flag`.
       Users can use annotation to provide those info.
-    - 2 bits field name encoding:
-      - encoding: `UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL/TAG_ID`
-      - If tag id is used, i.e. field name is written by an unsigned varint 
tag id. 2 bits encoding will be `11`.
-    - size of field name:
-      - The `3 bits size: 0~7`  will be used to indicate length `1~7`, the 
value `7` indicates to read more bytes,
+        - 2 bits field name encoding:
+            - encoding: 
`UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL/TAG_ID`
+            - If tag id is used, i.e. field name is written by an unsigned 
varint tag id. 2 bits encoding will be `11`.
+        - size of field name:
+            - The `3 bits size: 0~7`  will be used to indicate length `1~7`, 
the value `7` indicates to read more bytes,
               the encoding will encode `size - 7` as a varint next.
-      - If encoding is `TAG_ID`, then num_bytes of field name will be used to 
store tag id.
-    - ref tracking: when set to 1, ref tracking will be enabled for this field.
-    - nullability: when set to 1, this field can be null.
-    - polymorphism: when set to 1, the actual type of field will be the 
declared field type even the type if
+            - If encoding is `TAG_ID`, then num_bytes of field name will be 
used to store tag id.
+        - ref tracking: when set to 1, ref tracking will be enabled for this 
field.
+        - nullability: when set to 1, this field can be null.
+        - polymorphism: when set to 1, the actual type of field will be the 
declared field type even the type if
           not `final`.
-  - field name: If tag id is set, tag id will be used instead. Otherwise meta 
string encoding `[length]` and data will
+    - field name: If tag id is set, tag id will be used instead. Otherwise 
meta string encoding `[length]` and data will
       be written instead.
-  - type id:
-    - For registered type-consistent classes, it will be the registered type 
id.
-    - Otherwise it will be encoded as `OBJECT_ID` if it isn't `final` and 
`FINAL_OBJECT_ID` if it's `final`. The
+    - type id:
+        - For registered type-consistent classes, it will be the registered 
type id.
+        - Otherwise it will be encoded as `OBJECT_ID` if it isn't `final` and 
`FINAL_OBJECT_ID` if it's `final`. The
           meta for such types is written separately instead of inlining here 
is to reduce meta space cost if object of
           this type is serialized in current object graph multiple times, and 
the field value may be null too.
 
@@ -407,10 +401,10 @@ Notes:
 
 - size: 1~9 byte
 - Fury SLI(Small long as int) Encoding:
-  - If long is in `[0, 2147483647]`, encode as 4 bytes int: `| little-endian: 
((int) value) << 1 |`
-  - Otherwise write as 9 bytes: `| 0b1 | little-endian 8 bytes long |`
+    - If long is in `[0, 2147483647]`, encode as 4 bytes int: `| 
little-endian: ((int) value) << 1 |`
+    - Otherwise write as 9 bytes: `| 0b1 | little-endian 8 bytes long |`
 - Fury PVL(Progressive Variable-length Long) Encoding:
-  - positive long format: first bit in every byte indicates whether to have 
the next byte. If first bit is set
+    - positive long format: first bit in every byte indicates whether to have 
the next byte. If first bit is set
       i.e. `b & 0x80 == 0x80`, then the next byte should be read until the 
first bit is unset.
 
 #### signed int64
@@ -422,10 +416,10 @@ Notes:
 
 - size: 1~9 byte
 - Fury SLI(Small long as int) Encoding:
-  - If long is in `[-1073741824, 1073741823]`, encode as 4 bytes int: `| 
little-endian: ((int) value) << 1 |`
-  - Otherwise write as 9 bytes: `| 0b1 | little-endian 8 bytes long |`
+    - If long is in `[-1073741824, 1073741823]`, encode as 4 bytes int: `| 
little-endian: ((int) value) << 1 |`
+    - Otherwise write as 9 bytes: `| 0b1 | little-endian 8 bytes long |`
 - Fury PVL(Progressive Variable-length Long) Encoding:
-  - First convert the number into positive unsigned long by `(v << 1) ^ (v >> 
63)` ZigZag algorithm to reduce cost of
+    - First convert the number into positive unsigned long by `(v << 1) ^ (v 
>> 63)` ZigZag algorithm to reduce cost of
       small negative numbers, then encoding it as an unsigned long.
 
 #### float32


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(fury-site) 02/03: 🔄 synced local 'docs/specification/' with remote 'docs/specification/'

Reply via email to