Kriskras99 commented on issue #365:
URL: https://github.com/apache/avro-rs/issues/365#issuecomment-3730627801

   Over the past month I've done a lot of thinking about how to improve the 
enum support in Avro and I think I finally have something that can work. It's 
something that's workable in both the derive code and the general code, and 
remains compatible with current schemas.
   
   # Current state of Rust enum support in Avro
   
   ## Plain enums
   Plain enums are serialized as Avro enums (only enum support by 
`#[derive(AvroSchema)]`):
   ```rust
   pub enum Foo {
       A,
       B,
       C,
   }
   ```
   ```json
   {
     "name": "Foo",
     "type": "enum",
     "symbols": ["A", "B", "C"]
   }
   ```
   
   ## Data enums
   Data enums are serialized as a record with a discriminator field (Avro enum) 
and a value field (Avro union):
   ```rust
   pub struct Bar {
       integer: i32,
   }
   pub enum Foo {
       A {
           field: String,
       },
       B(Bar)
   }
   ```
   ```json
   {
     "name": "Foo",
     "type": "record",
     "fields": [
       {
         "name": "type",
         "type": {
           "type": "enum",
           "symbols": ["A", "B"]
         }
       },
       {
         "name": "value",
         "type": [
           {
             "type": "record",
             "fields": [
               {
                 "name": "field",
                 "type": "string"
               }
             ]
           },
           {
             "name": "Bar",
             "type": "record",
             "fields": [
               {
                 "name": "integer",
                 "type": "int"
               }
             ]
           }
         ]
       }
     ]
   }
   ```
   The advantage of this approach is that it works for enums where multiple 
variants of the same type. It does not currently
   work with mixed enums as a unit variant will always be encoded as an Avro 
enum.
   
   ## Options
   Options have special support in the encoding logic to always produce a bare 
union.
   ```rust
   type Foo = Option<String>;
   ```
   ```json
   [
     "null",
     "string"
   ]
   ```
   
   # Alternative representations
   These alternative representations support mixed enums.
   
   ## Bare union
   If all types are unique, then a regular union can be used:
   ```rust
   pub enum Foo {
       A {
           field: String,
       },
       B(Bar),
       C,
   }
   ```
   ```json
   [
     {
       "type": "record",
       "name": "A",
       "fields": [
         {
           "name": "field",
           "type": "string"
         }
       ]
     },
     {
       "name": "Bar",
       "type": "record",
       "fields": [
         {
           "name": "integer",
           "type": "int"
         }
       ]
     },
     "null"
   ]
   ```
   Using the variant name as the namespace can prevent collisions for named 
schema types, this however doesn't work for
   unnamed types.
   
   ## Union with a record for every variant
   ```rust
   pub enum Foo {
       A {
           field: String,
       },
       B(Bar),
       C,
   }
   ```
   ```json
   [
     {
       "type": "record",
       "name": "A",
       "fields": [
         {
           "name": "field",
           "type": "string"
         }
       ]
     },
     {
       "type": "record",
       "name": "B",
       "fields": [
         {
           "name": "inner",
           "type": {
             "name": "Bar",
             "type": "record",
             "fields": [
               {
                 "name": "integer",
                 "type": "int"
               }
             ]
           }
         }
       ]
     },
     {
       "type": "record",
       "name": "C",
       "fields": [
         {
           "name": "inner",
           "type": "null"
         }
       ]
     }
   ]
   ```
   This representation will always work and is always as efficient when binary 
encoded compared to the bare union. However, 
   the schema definition does become larger inflating the JSON but also the 
in-memory representation.
   
   # Proposal
   
   ## Deriving
   It would be good to broaden the support of enums we can (de)serialize, I 
suggest the following schema derive strategy:
   1. If it is a plain enum, emit an Avro enum
   2. If it is a mixed/data enum, try to emit an Avro union
   3. If that fails, emit an Avro record with a `type` and `value` field
   
   ## Encoding/decoding
   When encoding or decoding we would just look at the schema to see what is 
expected. This does mean changing the signature
   of `to_value(value: S) -> Result<Value, Error>` to `to_value(value: S, 
schema: &Schema) -> Result<Value, Error>`. We could
   also put a bound on `S` that it has to implement `AvroSchema`, but that's 
not good performance wise as it cannot be cached.
   
   P.S. While thinking about this problem, it occurred to me that it's possible 
that the `AvroSchema` derive can produce
   invalid schemas (fields with the same name because of a `rename`, 
`Option<T>` where `T`'s schema is `null`). Users who
   are creating a schema by hand (either completely or generating schemas in 
code) can of course also have this issue. I
   think it would be a good idea to add a `Schema::validate(&self) -> 
Result<(), Error>` function so we can validate the
   generated schema in the derive, and users can check their own schemas.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to