chaokunyang opened a new issue, #1939:
URL: https://github.com/apache/fury/issues/1939

   ### Feature Request
   
   If schema evolution mode is enabled globally when creating fury, and enabled 
for current type, type meta will be written
   using one of the following mode. Which mode to use is configured when 
creating fury.
   
   - Normal mode(meta share not enabled):
     - If type meta hasn't been written before, add `type def`
         to `captured_type_defs`: `captured_type_defs[type def] = map size`.
     - Get index of the meta in `captured_type_defs`, write that index as `| 
unsigned varint: index |`.
     - After finished the serialization of the object graph, fury will start to 
write `captured_type_defs`:
       - Firstly, set current to `meta start offset` of fury header
       - Then write `captured_type_defs` one by one:
   
         ```python
         buffer.write_var_uint32(len(writting_type_defs) - 
len(schema_consistent_type_def_stubs))
         for type_meta in writting_type_defs:
             if not type_meta.is_stub():
                 type_meta.write_type_def(buffer)
         writing_type_defs = copy(schema_consistent_type_def_stubs)
         ```
   
   - Meta share mode: the writing steps are same as the normal mode, but 
`captured_type_defs` will be shared across
     multiple serializations of different objects. For example, suppose we have 
a batch to serialize:
   
       ```python
       captured_type_defs = {}
       stream = ...
       # add `Type1` to `captured_type_defs` and write `Type1`
       fury.serialize(stream, [Type1()])
       # add `Type2` to `captured_type_defs` and write `Type2`, `Type1` is 
written before.
       fury.serialize(stream, [Type1(), Type2()])
       # `Type1` and `Type2` are written before, no need to write meta.
       fury.serialize(stream, [Type1(), Type2()])
       ```
   
   - Streaming mode(streaming mode doesn't support meta share):
     - If type meta hasn't been written before, the data will be written as:
   
         ```
         | unsigned varint: 0b11111111 | type def |
         ```
   
     - If type meta has been written before, the data will be written as:
   
         ```
         | unsigned varint: written index << 1 |
         ```
   
         `written index` is the id in `captured_type_defs`.
     - With this mode, `meta start offset` can be omitted.
   
   > The normal mode and meta share mode will forbid streaming writing since it 
needs to look back for update the start
   > offset after the whole object graph writing and meta collecting is 
finished. Only in this way we can ensure
   > deserialization failure in meta share mode doesn't lost shared meta.
   
   #### Type Def
   
   Here we mainly describe the meta layout for schema evolution mode:
   
   ```
   |      8 bytes meta header      |   variable bytes   |  variable bytes   | 
variable bytes |
   
+-------------------------------+--------------------+-------------------+----------------+
   | 7 bytes hash + 1 bytes header |  current type meta |  parent type meta |   
   ...       |
   ```
   
   Type meta are encoded from parent type to leaf type, only type with 
serializable fields will be encoded.
   
   ##### Meta header
   
   Meta header is a 64 bits number value encoded in little endian order.
   
   - Lowest 4 digits `0b0000~0b1110` are used to record num classes. `0b1111` 
is preserved to indicate that Fury need to
     read more bytes for length using Fury unsigned int encoding. If current 
type doesn't has parent type, or parent
     type doesn't have fields to serialize, or we're in a context which 
serialize fields of current type
     only, num classes will be 1.
   - The 5th bit is used to indicate whether this type needs schema evolution.
   - Other 56 bits are used to store the unique hash of `flags + all layers 
type meta`.
   
   ##### Single layer type meta
   
   ```
   | unsigned varint | var uint |  field info: variable bytes   | variable 
bytes  | ... |
   
+-----------------+----------+-------------------------------+-----------------+-----+
   |   num_fields    | type id  | header + type id + field name | next field 
info | ... |
   ```
   
   - num fields: encode `num fields` as unsigned varint.
     - If the current type is schema consistent, then num_fields will be `0` to 
flag it.
     - If the current type isn't schema consistent, then num_fields will be the 
number of compatible fields. For example,
         users can use tag id to mark some fields as compatible fields in 
schema consistent context. In such cases, schema
         consistent fields will be serialized first, then compatible fields 
will be serialized next. At deserialization,
         Fury will use fields info of those fields which aren't annotated by 
tag id for deserializing schema consistent
         fields, then use fields info in meta for deserializing compatible 
fields.
   - type id: the registered id for the current type, which will be written as 
an unsigned varint.
   - field info:
     - header(8
         bits): `3 bits size + 2 bits field name encoding + polymorphism flag + 
nullability flag + ref tracking flag`.
         Users can use annotation to provide those info.
       - 2 bits field name encoding:
         - encoding: 
`UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL/TAG_ID`
         - If tag id is used, i.e. field name is written by an unsigned varint 
tag id. 2 bits encoding will be `11`.
       - size of field name:
         - The `3 bits size: 0~7`  will be used to indicate length `1~7`, the 
value `7` indicates to read more bytes,
                 the encoding will encode `size - 7` as a varint next.
         - If encoding is `TAG_ID`, then num_bytes of field name will be used 
to store tag id.
       - ref tracking: when set to 1, ref tracking will be enabled for this 
field.
       - nullability: when set to 1, this field can be null.
       - polymorphism: when set to 1, the actual type of field will be the 
declared field type even the type if
             not `final`.
     - field name: If tag id is set, tag id will be used instead. Otherwise 
meta string encoding `[length]` and data will
         be written instead.
     - type id:
       - For registered type-consistent classes, it will be the registered type 
id.
       - Otherwise it will be encoded as `OBJECT_ID` if it isn't `final` and 
`FINAL_OBJECT_ID` if it's `final`. The
             meta for such types is written separately instead of inlining here 
is to reduce meta space cost if object of
             this type is serialized in current object graph multiple times, 
and the field value may be null too.
   
   Field order are left as implementation details, which is not exposed to 
specification, the deserialization need to
   resort fields based on Fury field comparator. In this way, fury can compute 
statistics for field names or types and
   using a more compact encoding.
   
   ##### Other layers type meta
   
   Same encoding algorithm as the previous layer.
   
   ### Is your feature request related to a problem? Please describe
   
   _No response_
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   #Meta Enc


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to