PookieBuns opened a new issue, #449:
URL: https://github.com/apache/avro-rs/issues/449
Hi maintainers! First of all, I would love to thank you for your work
bringing apache avro into the Rust ecosystem. I am truly grateful. I have been
trying to adopt avro more at work, but in its current state, it is quite
brittle, especially with respect to using serde with Unions that contain a null
variant. I plan on contributing to fix these issues, but before then I would
like some input from the maintainers with respect to what we would like to
support wrt this crate.
# TLDR
- There are 3 ways to represent avro nullable types in Rust, some work and
some don't, but I'd like input on which ones we'd like to support
- For combination nullable unions, the rusty way requires
`#[serde(untagged)]` to work but it fails on multiple edge cases. Would like
input on which [plan](#ways-to-address) to address is better direction for this
crate
# The Rust Representation Conundrum
Given an schema of
```
{
"name": "MyUnion",
"type": [
"null",
"int"
]
}
```
There are currently 3 arguable ways to represent this with Rust types
## The current `avro-rs` way
```
enum MyUnionNullable {
Null,
Int(i32),
}
```
### Advantages
This aligns closest to how avro binary encoding works and is simplest to
serialize. Use the variant index from serde to properly encode the branch index
then encode the contained value
### Disadvantages
This is not "rusty", as null or missing values are generally represented as
`Option<T>`, and by doing this you lose the nice semantics of `Option`.
This also isn't the proper avro json encoding if you choose to serialize it
with `serde_json`.
According to the avro specification,
> The value of a union is encoded in JSON as follows:
>
> - if its type is null, then it is encoded as a JSON null;
> - otherwise it is encoded as a JSON object with one name/value pair whose
name is the type’s name and whose value is the recursively encoded value.
For Avro’s named types (record, fixed or enum) the user-specified name is
used, for other types the type name is used.
Which means that the branch should be encoded as `null` and the int branch
should be encoded as `{ "int": 1 }`, but in this representation the null branch
is encoded as `"Null"`.
## The "Rusty" way
```
Option<i32>
```
### Advantages
This is the most intuitive for Rust users
### Disadvantages
This also isn't the proper avro json encoding if you choose to serialize it
with `serde_json` because the int branch gets encoded as `1` without the
discriminator
## The Anchor On Avro Json Encoding And Serde Json Way
```
enum MyUnionAvroJsonEncoding {
#[serde(rename = "int")]
Int(i32),
}
Option<MyUnionAvroJsonEncoding
```
### Advantages
This is technically the most "correct" representation because it serializes
exactly according to the avro json encoding specification
### Disadvantages
This is even worse looking than the `avro-rs` way
# Implications
This begs the question of
**What variations of these Rust representations should we support in our
serializer and deserializer, and is it even possible to support said
representation?**
As of today, 1 and 2 are fully supported. EXCEPT, for nullable enums, thanks
to me causing a regression with #337 (Which is unreleased so I guess not so
bad). Not going to get into the details but basically Enums and Unions are the
same type in Rust, and with serde you can't tell if you're unit variant is
serializing an Enum like type or a Union like type.
IMHO the current plan of only supporting 1 and 2 for 2 variant nullable
unions is probably sufficient, and avro json encoding should not be done by
serde json but instead should be exposed by `apache-avro` with its own
serialization of converting an `apache-avro` value to a `serde_json` value.
# But what about Nullable Combination Unions?
```
{
"name": "MyUnion",
"type": [
"null",
"int",
{
"type": "enum",
"name": "MyEnum",
"symbols": ["A", "B"]
},
{
"type": "record",
"name": "MyRecord",
"fields": [
{"name": "a", "type": "int"}
]
}
]
}
```
There are in my opinion, two ways to represent this in rust
## The current `avro-rs` way
```
enum MyEnum {
A,
B,
}
struct MyRecord {
a: i32,
}
enum MyUnionNullable {
Null,
Int(i32),
MyEnum(MyEnum),
MyRecord(MyRecord),
}
```
## The "Rusty" way
```
enum MyEnum {
A,
B,
}
struct MyRecord {
a: i32,
}
enum MyUnion {
Int(i32),
MyEnum(MyEnum),
MyRecord(MyRecord),
}
Option<MyUnion>
```
# Implications
My #337 PR was actually aimed to solve this problem. Before said PR, the
"rusty" way of representing combination unions actually didn't work at all
since it would immediately pick the first variant of of the schema and try to
select it.
HOWEVER, my PR was an mediocre fix at best because in order for it to be
supported, `#[serde(untagged)]` was required to be decorated on the type, and
it would break in 2 edge cases.
1. Enums didn't work. This is because of the same problem that I mentioned
regarding the regression that I created
2. Because of untagged, deserialization would deserialize to the incorrect
variant if their fields were identical, or if an earlier variant could "serde"
deserialize a later variant with its Option semantics, example shown below
```
const NULLABLE_RECORD_SCHEMA: &str = r#"
{
"name": "MyUnion",
"type": [
"null",
{
"type": "record",
"name": "MyRecordA",
"fields": [
{"name": "a", "type": "int"}
]
},
{
"type": "record",
"name": "MyRecordB",
"fields": [
{"name": "a", "type": "int"}
]
}
]
}
"#;
#[derive(Debug, Serialize, Deserialize, PartialEq)]
struct MyRecordA {
a: i32,
}
#[derive(Debug, Serialize, Deserialize, PartialEq)]
struct MyRecordB {
a: i32,
}
#[derive(Debug, Serialize, Deserialize, PartialEq)]
#[serde(untagged)]
enum MyUnionUntagged {
MyRecordA(MyRecordA),
MyRecordB(MyRecordB),
}
```
In the above example, an avro of variant `MyRecordB` will always deserialize
to `MyRecordA` because untagged means there's no discriminant and the best
effort deserialization chooses MyRecordA
## Ways to address
The goals of my 2 proosals here are to get rid of the reliance on
#[serde(untagged)] for serialization. I haven't dove into the deserialization
yet, but rest assured, these plans will not become a PR unless a roundtrip can
be successful.
1. Lookup Union variants by name rather than index. serde
serialize_x_variant provides the name of the variant branch you are in. Instead
of lookup by variant_index, lookup by variant_name to match the schema. Unnamed
types like primitves in avro will just follow the name as provided by json
encoding. This change would allow enums to be in whatever order they want, but
add a restriction that names must strictly follow their primitive name i.e. Int
String Long. This would be a breaking change since previous users would now
need to rename their variants to align with the naming convention. Also there
may be a potential performance hit because we need to scan our union schemas to
match, although I'm not sure how relevant that is. We could also technically
only do Lookup by Name if we use the Option<Enum> representation, since we
would be notified through `serialize_some`
2. Still use index based lookup, BUT make the assumption that if you use
Option<Enum>, The null variant must be the 0th branch, and properly bump index
by 1 when you arrive at "serialize_x_variant." However, this would mean that if
you use Option<Enum>, you are restricted to having a certain schema shape, but
in the vast majority of cases, null is always the first variant (I recall its
mentioned in the spec that this is standard practice)
# Final thoughts
`serde` and `avro` don't play the nicest together, but its what we have. My
goal is for serialization and deserialization to work properly for everyone
trying to use the crate while trying to follow idiomatic rust as possible.
Sorry for the long post, but thanks for your attention.
# Appendix
I attached a comprehensive minimal test code that documents the different
successes and failures of representations of unions with null variant in rust.
(GH doesn't let me upload an rs file so its in txt)
[test_nullable_union.txt](https://github.com/user-attachments/files/24978848/test_nullable_union.txt)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]