[I] [Python] Schema evolution support for type backward/forward compatibility [fury]

via GitHub Thu, 07 Nov 2024 22:53:24 -0800


chaokunyang opened a new issue, #1938:
URL: https://github.com/apache/fury/issues/1938

### Feature Request

If schema evolution mode is enabled globally when creating fury, and enabled
for current type, type meta will be written
using one of the following mode. Which mode to use is configured when
creating fury.

- Normal mode(meta share not enabled):
- If type meta hasn't been written before, add `type def`
to `captured_type_defs`: `captured_type_defs[type def] = map size`.
- Get index of the meta in `captured_type_defs`, write that index as `|
unsigned varint: index |`.
- After finished the serialization of the object graph, fury will start to
write `captured_type_defs`:
- Firstly, set current to `meta start offset` of fury header
- Then write `captured_type_defs` one by one:

```python
buffer.write_var_uint32(len(writting_type_defs) -
len(schema_consistent_type_def_stubs))
for type_meta in writting_type_defs:
if not type_meta.is_stub():
type_meta.write_type_def(buffer)
writing_type_defs = copy(schema_consistent_type_def_stubs)
```

- Meta share mode: the writing steps are same as the normal mode, but
`captured_type_defs` will be shared across
multiple serializations of different objects. For example, suppose we have
a batch to serialize:

```python
captured_type_defs = {}
stream = ...
# add `Type1` to `captured_type_defs` and write `Type1`
fury.serialize(stream, [Type1()])
# add `Type2` to `captured_type_defs` and write `Type2`, `Type1` is
written before.
fury.serialize(stream, [Type1(), Type2()])
# `Type1` and `Type2` are written before, no need to write meta.
fury.serialize(stream, [Type1(), Type2()])
```

- Streaming mode(streaming mode doesn't support meta share):
- If type meta hasn't been written before, the data will be written as:

```
| unsigned varint: 0b11111111 | type def |
```

- If type meta has been written before, the data will be written as:

```
| unsigned varint: written index << 1 |
```

`written index` is the id in `captured_type_defs`.
- With this mode, `meta start offset` can be omitted.

> The normal mode and meta share mode will forbid streaming writing since it
needs to look back for update the start
> offset after the whole object graph writing and meta collecting is
finished. Only in this way we can ensure
> deserialization failure in meta share mode doesn't lost shared meta.

#### Type Def

Here we mainly describe the meta layout for schema evolution mode:

+-------------------------------+--------------------+-------------------+----------------+
| 7 bytes hash + 1 bytes header | current type meta | parent type meta |
... |
```

Type meta are encoded from parent type to leaf type, only type with
serializable fields will be encoded.

##### Meta header

Meta header is a 64 bits number value encoded in little endian order.

- Lowest 4 digits `0b0000~0b1110` are used to record num classes. `0b1111`
is preserved to indicate that Fury need to
read more bytes for length using Fury unsigned int encoding. If current
type doesn't has parent type, or parent
type doesn't have fields to serialize, or we're in a context which
serialize fields of current type
only, num classes will be 1.
- The 5th bit is used to indicate whether this type needs schema evolution.
- Other 56 bits are used to store the unique hash of `flags + all layers
type meta`.

##### Single layer type meta

+-----------------+----------+-------------------------------+-----------------+-----+
| num_fields | type id | header + type id + field name | next field
info | ... |
```

- num fields: encode `num fields` as unsigned varint.
- If the current type is schema consistent, then num_fields will be `0` to
flag it.
- If the current type isn't schema consistent, then num_fields will be the
number of compatible fields. For example,
users can use tag id to mark some fields as compatible fields in
schema consistent context. In such cases, schema
consistent fields will be serialized first, then compatible fields
will be serialized next. At deserialization,
Fury will use fields info of those fields which aren't annotated by
tag id for deserializing schema consistent
fields, then use fields info in meta for deserializing compatible
fields.
- type id: the registered id for the current type, which will be written as
an unsigned varint.
- field info:
- header(8
bits): `3 bits size + 2 bits field name encoding + polymorphism flag +
nullability flag + ref tracking flag`.
Users can use annotation to provide those info.
- 2 bits field name encoding:
- encoding:
`UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL/TAG_ID`
- If tag id is used, i.e. field name is written by an unsigned varint
tag id. 2 bits encoding will be `11`.
- size of field name:
- The `3 bits size: 0~7` will be used to indicate length `1~7`, the
value `7` indicates to read more bytes,
the encoding will encode `size - 7` as a varint next.
- If encoding is `TAG_ID`, then num_bytes of field name will be used
to store tag id.
- ref tracking: when set to 1, ref tracking will be enabled for this
field.
- nullability: when set to 1, this field can be null.
- polymorphism: when set to 1, the actual type of field will be the
declared field type even the type if
not `final`.
- field name: If tag id is set, tag id will be used instead. Otherwise
meta string encoding `[length]` and data will
be written instead.
- type id:
- For registered type-consistent classes, it will be the registered type
id.
- Otherwise it will be encoded as `OBJECT_ID` if it isn't `final` and
`FINAL_OBJECT_ID` if it's `final`. The
meta for such types is written separately instead of inlining here
is to reduce meta space cost if object of
this type is serialized in current object graph multiple times,
and the field value may be null too.

Field order are left as implementation details, which is not exposed to
specification, the deserialization need to
resort fields based on Fury field comparator. In this way, fury can compute
statistics for field names or types and
using a more compact encoding.

##### Other layers type meta

Same encoding algorithm as the previous layer.

### Is your feature request related to a problem? Please describe

_No response_

### Describe the solution you'd like

_No response_

### Describe alternatives you've considered

_No response_

### Additional context

_No response_

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [Python] Schema evolution support for type backward/forward compatibility [fury]

Reply via email to