GitHub user spetz created a discussion: Iggy improvements for current & future 
use-cases

While working on Iggy, as well as using it in existing projects, I thought of 
the following improvements (some of them will be breaking changes), related to 
both the SDK and the server.

### Optional message key for partitioning

Currently, to distribute the messages across multiple partitions, while at the 
same time ensuring that the particular records sharing the same arbitrary key 
(e.g. `user_id`, `order_id` etc.) will always be appended to the same partition 
(to guarantee the ordering of the related messages), the user has to choose 
`MessagesKey` where the key is an arbitrary array of bytes, which is then 
hashed on the server side, and the partition destination is simply calculated 
as `hash(key) % partitions_count`
```rust
pub enum PartitioningKind {
    /// The partition ID is calculated by the server using the round-robin 
algorithm.
    #[default]
    Balanced,
    /// The partition ID is provided by the client.
    PartitionId,
    /// The partition ID is calculated by the server using the hash of the 
provided messages key.
    MessagesKey,
}
```

However, the main downside is that it's applied to the whole batch of messages, 
hence all the records (even if there's 1000 of them in a single batch), will 
always be appended to the same partition - this design makes it unusable for a 
scenario in which you might want to include the records totally unrelated to 
each other, while at the same time distributing them across many partitions 
while ensuring the local ordering.

In other solutions like Kafka, the key can be assigned to each message 
separately, and it should be also possible for Iggy. I think that we would need 
to extend the existing message builder with the mentioned optional key, and 
then figure out how to include the "keyed messages" in `IggyMessagesBatch` to 
ensure maximum performance & efficiency with the existing zero-copy 
(de)serialization.

As for the expected behavior, the server should accept any batch of the 
messages, where some of them could have arbitrary keys assigned, while the 
other ones don't need to - in such a case, the remaining messages (without the 
key) would probably use the existing `balanced` mode (round-robin assignment of 
partition ID on the server-side) as the default fallback.

### Additional flags in message header

I guess it makes sense to reserve a few more bytes in the `IggyMessageHeader` 
and for starters, include a single field called `format` of `u8` size. It could 
pass a numeric variant mapped to the used (de)serialization format, e.g., 
`custom=0` `json=1` `avro=2` etc. which could be helpful in the future when 
implementing the schema registry or data transformations, without the need of 
using some custom, hardcoded header key expected by the server side to 
interpret the message format appropriately.

```rust
pub struct IggyMessageHeader {
    pub checksum: u64,
    pub id: u128,
    pub offset: u64,
    pub timestamp: u64,
    pub origin_timestamp: u64,
    pub user_headers_length: u32,
    pub payload_length: u32,
    pub format: u8 // Optional serialization format
}
```

### Allow non-string header key

It's always been possible to include any custom metadata to the message (e.g. 
similar to HTTP headers) which can be used for any internal purposes by the 
client apps (and maybe also in the future by the server itself).
```rust
user_headers: Option<HashMap<HeaderKey, HeaderValue>>
```

```rust
pub struct HeaderValue {
    /// The kind of the header value.
    pub kind: HeaderKind,
    /// The binary value of the header payload.
    #[serde_as(as = "Base64")]
    pub value: Bytes,
}
```

```rust
/// Current design
pub struct HeaderKey(String);
```

While `HeaderValue` allows specifying one of the available types (e.g. raw 
binary, string, u8 etc.), the `HeaderKey` must be a string. I think that for 
the overall flexibility of the solution + micro performance optimizations, it 
should be possible to use the same struct for the key as for its value and e.g. 
send a header `1=2` where both types are `u8`.

```rust
/// New design
pub struct HeaderKey {
    /// The kind of the header key.
    pub kind: HeaderKind,
    /// The binary value of the header key.
    #[serde_as(as = "Base64")]
    pub value: Bytes,
}
```

### Resource metadata

Currently, the server's resources such as stream, topic, partition, segment, 
user have a strict schema, without any way to include some additional tags or 
metadata. It could be useful for future purposes (e.g. related to some custom 
extensions or plugins), or as a way to additionally tag (and maybe filter on, 
etc.) the particular resource, where just a name or ID isn't enough.

Let's consider the `stream` example
```rust
#[derive(Debug, Serialize, Deserialize)]
pub struct Stream {
    /// The unique identifier (numeric) of the stream.
    pub id: u32,
    /// The timestamp when the stream was created.
    pub created_at: IggyTimestamp,
    /// The unique name of the stream.
    pub name: String,
    /// The total size of the stream in bytes.
    pub size: IggyByteSize,
    /// The total number of messages in the stream.
    pub messages_count: u64,
    /// The total number of topics in the stream.
    pub topics_count: u32,
    /// Optional metadata in a form of arbitrary headers
    pub metadata: Option<HashMap<HeaderKey, HeaderValue>>
}
```

The additional `metadata` could follow the same design as the custom user 
headers that can be attached to the message - just a bunch of arbitrary headers 
(`key=value`) which could be used either for the user's purposes (e.g. tagging 
the topic with some specific values which are then interpreted by the client 
apps).

Please share your opinions and ideas about what would make sense to include in 
the upcoming release(s).

GitHub link: https://github.com/apache/iggy/discussions/2554

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to