Re: Iggy Connector Architecture Involving Data/Stream Processors like Flink

via GitHub Sat, 18 Oct 2025 17:09:10 -0700


GitHub user chiradip added a comment to the discussion: Iggy Connector 
Architecture Involving Data/Stream Processors like Flink


# Connector Library Placement: Where Should Common Code Live?

**Author**: Chiradip
**Date**: 2025-10-16
**Context**: While architecting the Flink connector for Iggy, I realized we'd 
have substantial reusable code (~88%) for future processors like Spark. This 
led to a fundamental question about where this common connector code should 
live.

---

## The Question

While working on `option_x_1.md`, I initially proposed:

```
external-processors/
└── iggy-connector-flink/
    └── iggy-connector-library/          # Contains common + Flink code
        └── org/apache/iggy/connector/
            ├── config/                  # Framework-agnostic
            ├── serialization/           # Framework-agnostic
            ├── offset/                  # Framework-agnostic
            └── flink/                   # Flink-specific
```

But then I thought: **What if the common connector code 
(`org.apache.iggy.connector.*`) actually belongs in the Iggy java-sdk itself?**

This isn't just a file organization question - it's a philosophical one about 
what "SDK" means for Iggy.

---

## Why I'm Even Considering This

### 1. These Aren't Really "Flink Patterns" - They're "Iggy Patterns"

When I look at what's in the common code:
- **Offset tracking** - This is about Iggy's consumer offset model, not Flink's
- **Partition discovery** - This is about discovering Iggy partitions
- **Connection pooling** - Managing connections to Iggy servers
- **Serialization schemas** - Converting Iggy message bytes to/from objects

All of these are **Iggy integration patterns**, not stream processor patterns. 
They're reusable precisely because they encode knowledge about Iggy's 
semantics, not Flink's or Spark's.

### 2. The Kafka Precedent

I looked at how Kafka does this:

```
kafka-clients/
├── org/apache/kafka/clients/
│   ├── consumer/                    # Consumer client
│   ├── producer/                    # Producer client
│   └── admin/                       # Admin operations
└── org/apache/kafka/common/
    ├── serialization/               # SerDes (Serializer/Deserializer)
    ├── config/                      # Configuration abstractions
    └── errors/                      # Common exceptions

kafka-streams/                       # Higher-level stream processing API
flink-connector-kafka/               # Lives in Flink ecosystem, depends on 
kafka-clients
spark-connector-kafka/               # Lives in Spark ecosystem, depends on 
kafka-clients
```

Kafka includes serialization, configuration patterns, and common utilities **in 
the core client library**. Flink and Spark connectors then depend on 
`kafka-clients` and wrap it in their respective APIs.

This suggests there's precedent for "integration patterns" living in the SDK.

### 3. User Experience Simplification

**Current approach** (separate library):
```kotlin
dependencies {
    implementation("org.apache.iggy:iggy:0.5.0")                    // SDK
    implementation("org.apache.iggy:iggy-connector-library:0.5.0")  // Common 
connector code
    compileOnly("org.apache.flink:flink-streaming-java:1.18.0")     // Flink
}
```

**If in SDK**:
```kotlin
dependencies {
    implementation("org.apache.iggy:iggy:0.5.0")                    // SDK + 
common patterns
    compileOnly("org.apache.flink:flink-streaming-java:1.18.0")     // Flink
}
```

One less dependency to manage. One less version to coordinate.

---

## Why I'm Hesitant

### 1. SDK Bloat

The Iggy SDK is currently lean and focused: it's a client for talking to Iggy 
servers. Adding ~1500 LOC of connector abstractions changes its character.

Not everyone who uses the SDK needs:
- Offset tracking strategies
- Partition discovery algorithms
- Sophisticated retry policies
- Multiple serialization formats (JSON, Avro, etc.)

A microservice that just publishes events to Iggy doesn't need this extra 
weight.

### 2. Dependency Contamination

The connector abstractions would need dependencies:
```kotlin
dependencies {
    // Core Iggy SDK needs these
    implementation("org.apache.httpcomponents.client5:httpclient5:5.4.3")
    implementation("com.fasterxml.jackson.core:jackson-databind:2.18.0")

    // NEW: Connector patterns would add these
    implementation("org.apache.avro:avro:1.11.3")              // For Avro 
serialization
    implementation("io.micrometer:micrometer-core:1.12.0")    // For metrics 
collection
    // ... more as serialization formats grow
}
```

Should every SDK user pull in Avro dependencies even if they never serialize to 
Avro?

**Counter-argument**: We could use `compileOnly` or make them optional 
dependencies, but that adds complexity.

### 3. Release Coupling

If the connector patterns need a breaking change (e.g., we discover offset 
tracking needs a different interface), it forces the entire SDK to bump major 
version.

**Example scenario**:
- SDK is stable at v0.6.0
- We discover offset tracking needs API changes for Spark integration
- Must release SDK v0.7.0 even though client code didn't change
- All users must upgrade, even those not using connectors

With separate artifacts:
- SDK stays at v0.6.0
- `iggy-connector-common` bumps to v0.4.0
- Only connector users are affected

### 4. Conceptual Boundary

I think there's value in maintaining a clear line:

**SDK = "How to communicate with Iggy"**
- Connect to server
- Send messages
- Receive messages
- Manage streams/topics
- Handle consumer groups

**Connector Library = "How to integrate Iggy with data processing frameworks"**
- Offset management strategies
- Partition assignment logic
- Serialization format handling
- Retry and error handling patterns

Different audiences:
- **SDK users**: Application developers building services
- **Connector users**: Data engineers building pipelines

Mixing these might confuse the story.

### 5. Portability Implications

If we eventually contribute the Flink connector to Apache Flink (or it moves to 
a separate repo), having it depend on a lightweight `iggy-connector-common` is 
cleaner than depending on the full Iggy SDK.

**Flink connector depending on Iggy SDK:**
- Pulls in HTTP client, full message models, admin APIs
- Heavier dependency footprint

**Flink connector depending on connector-common:**
- Just the abstractions needed for integration
- Lighter, more focused

---

## The Middle Ground I'm Considering

### Option 1: SDK with Modular Structure

Keep it in SDK but make it clearly optional:

```
java-sdk/
└── src/
    └── main/java/org/apache/iggy/
        ├── client/              # Core client (required)
        ├── messages/            # Message models (required)
        ├── streams/             # Stream operations (required)
        └── connector/           # Integration patterns (optional)
            ├── config/
            ├── serialization/
            ├── offset/
            └── partition/
```

Publish single artifact but document that `connector.*` packages are optional 
utilities for integration scenarios.

**Pros**:
- One artifact, simpler for users
- Natural discoverability
- Connector patterns get SDK-level stability guarantees

**Cons**:
- Still has bloat/dependency issues
- Can't version independently
- Mixed concerns in one artifact

### Option 2: Separate Module, Co-Located in java-sdk Project

```
foreign/java/
├── java-sdk/                           # Core client
│   └── build.gradle.kts
├── java-connector-common/              # NEW: Connector abstractions
│   └── build.gradle.kts
│       dependencies {
│           api(project(":java-sdk"))   # Depends on SDK
│       }
└── external-processors/
    └── iggy-connector-flink/
        └── build.gradle.kts
            dependencies {
                api("org.apache.iggy:iggy-connector-common:0.5.0")
            }
```

**Published artifacts**:
- `org.apache.iggy:iggy` - Core SDK
- `org.apache.iggy:iggy-connector-common` - Connector abstractions
- Flink/Spark connectors depend on `iggy-connector-common`

**Pros**:
- Clean separation of concerns
- Independent versioning
- SDK stays lean
- Connector patterns live at Iggy level (not scattered in processor projects)
- Easy to find: still in `foreign/java/`

**Cons**:
- Two artifacts instead of one
- Users need both dependencies
- More complex dependency graph

### Option 3: Stay in iggy-connector-flink, Extract When Spark Arrives

The current plan in `option_x_1.md`:
- Start with `iggy-connector-library` inside `iggy-connector-flink/`
- Contains both common and Flink code
- When Spark is added, evaluate and extract common to `iggy-connector-common/` 
if reuse >70%

**Pros**:
- Pragmatic: build first, extract later
- Allows patterns to stabilize before promoting
- YAGNI principle: don't create abstractions until second use case exists
- Flink-first approach keeps things simple initially

**Cons**:
- Requires refactoring when Spark arrives
- Flink connector temporarily "owns" common code
- Users might start depending on internal packages

---

## My Current Thinking

### Decision Framework

I'm leaning towards making this decision based on **pattern maturity**:

```
┌─────────────────────────────────────────────────────────────┐
│  Pattern Lifecycle                                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. Experimental (in processor-specific package)            │
│     └─> Patterns are unproven, rapidly changing            │
│                                                             │
│  2. Stable (extract to java-connector-common)               │
│     └─> Battle-tested by 2+ processors, API stable         │
│                                                             │
│  3. Foundational (promote to java-sdk)                      │
│     └─> Universal patterns, used by majority of users      │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

### Phase 1: Start in iggy-connector-flink (Now)

Keep the current `option_x_1.md` structure:

```
external-processors/
└── iggy-connector-flink/
    └── iggy-connector-library/
        └── org/apache/iggy/connector/
            ├── config/                  # Common abstractions
            ├── serialization/           # Common abstractions
            ├── offset/                  # Common abstractions
            └── flink/                   # Flink-specific
```

**Why**:
- Patterns are still **experimental** - I haven't actually built them yet!
- Allows rapid iteration without breaking SDK compatibility
- No commitment until we see real-world usage
- YAGNI: don't extract until proven valuable by second consumer (Spark)

**Design discipline**:
- Keep `org.apache.iggy.connector.*` (non-flink) packages framework-agnostic
- No Flink imports in common code
- Design as if they'll be extracted later

### Phase 2: Extract to java-connector-common (When Spark Arrives)

After implementing Spark connector and validating >70% reuse:

```
foreign/java/
├── java-sdk/                           # Lean core
├── java-connector-common/              # NEW: Proven patterns
│   └── org/apache/iggy/connector/
│       ├── config/
│       ├── serialization/
│       └── offset/
└── external-processors/
    ├── iggy-connector-flink/
    │   └── iggy-flink-connector/       # Flink-specific only
    └── iggy-connector-spark/
        └── iggy-spark-connector/       # Spark-specific only
```

**Why**:
- Patterns proven stable by 2 implementations
- Clear reuse case justifies extraction
- Clean separation achieved
- Independent versioning established

**Migration**:
1. Create `java-connector-common/` module
2. Move `org.apache.iggy.connector.*` packages from Flink library
3. Both Flink and Spark depend on `connector-common`
4. Publish `org.apache.iggy:iggy-connector-common`

### Phase 3: Consider SDK Promotion (After 1+ Year Stability)

After `connector-common` has been stable for 1+ year and multiple processors 
(Flink, Spark, maybe Beam):

**Evaluate**:
- Are these patterns used by >50% of Iggy users (not just processor 
integrations)?
- Have APIs been stable (no breaking changes in a year)?
- Are dependencies minimal (no heavy libs like Avro required in common)?
- Do non-processor use cases benefit (e.g., custom consumers using offset 
tracking)?

**If YES → Merge into SDK**:
```
java-sdk/
└── org/apache/iggy/
    ├── client/              # Core client
    ├── connector/           # Promoted patterns
    └── ...
```

**If NO → Keep separate**:
Connector patterns remain at integration level, not SDK level.

---

## Kafka vs Iggy: Key Difference

One thing I realized: **Kafka's client already includes SerDes because that's 
fundamental to Kafka's usage model.**

Every Kafka producer/consumer configures serializers:
```java
Properties props = new Properties();
props.put("key.serializer", 
"org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", 
"org.apache.kafka.common.serialization.JsonSerializer");
KafkaProducer<String, User> producer = new KafkaProducer<>(props);
```

Serialization is **core to Kafka's API**. So `kafka-clients` includes SerDes.

**Iggy is different**: The SDK sends/receives byte arrays. Serialization is 
application-level concern:
```java
IggyClient client = IggyClient.create(config);
byte[] messageBytes = objectMapper.writeValueAsBytes(user);
client.sendMessage(streamId, topicId, messageBytes);
```

So for Iggy, serialization schemas aren't SDK-level - they're 
**integration-level**. This supports keeping them separate.

---

## My Decision (For Now)

### Stick with option_x_1.md Structure

I'm going to:

1. **Keep `iggy-connector-library` in `iggy-connector-flink/`** as designed in 
option_x_1.md
2. **Use clear package separation**: `connector.*` (common) vs 
`connector.flink.*` (Flink-specific)
3. **Design for extraction**: Write common code as if it will be pulled out 
later
4. **Document the evolution path**: Clearly state in README that common code 
may be extracted when Spark is added

### Rationale

- **Pragmatic**: Build and prove the patterns first
- **Flexible**: Can extract to `connector-common` or even SDK later based on 
real data
- **Low risk**: If extraction never happens (e.g., Spark reuse is <40%), we 
haven't pre-optimized
- **Reversible**: Clear package boundaries make future refactoring 
straightforward

### Design Principles to Follow

To make future extraction easy:

1. ✅ **No Flink imports in `org.apache.iggy.connector.*`** (only in 
`connector.flink.*`)
2. ✅ **Minimal dependencies** in common code (avoid heavy libs)
3. ✅ **Interface-based design** (easy to swap implementations)
4. ✅ **Document extraction intent** in package-info.java
5. ✅ **Treat as "internal"** initially (no stability guarantees to external 
users)

### When to Revisit

Revisit this decision when:
- Spark connector is being implemented (evaluate actual reuse %)
- Common patterns have been stable for 6+ months
- External users ask to use connector patterns standalone
- We discover non-processor use cases for these utilities

---

## Questions I'm Still Pondering

### 1. Should Serialization Live in SDK?

Even if offset tracking and partition discovery stay in connectors, should 
serialization schemas be SDK-level?

**Argument for SDK**:
- Every Iggy user needs to serialize/deserialize
- Basic JSON/Avro helpers would benefit all users
- Natural place for `IggyMessage<T>` type-safe wrappers

**Argument against**:
- SDK is currently byte-array focused (clean, simple)
- Serialization format is application choice
- Adding Jackson/Avro deps to SDK is heavy

**Leaning**: Keep in connector library for now, but this is a strong candidate 
for eventual SDK inclusion.

### 2. What About Client Pooling?

Connection pooling in `org.apache.iggy.connector.util.IggyClientPool` - is this 
connector-level or SDK-level?

Many non-connector users would benefit from this. Maybe this should be in SDK 
sooner?

### 3. Naming: "Connector" vs "Integration"?

`org.apache.iggy.connector.*` implies these are connector-specific patterns.
`org.apache.iggy.integration.*` might better signal "reusable integration 
utilities."

But "connector" is more specific and matches industry terminology (Kafka uses 
"connector").

**Decision**: Stick with "connector" for now.

---

## Conclusion

I'm documenting this deliberation because I think it reflects an important 
architectural tension: **SDK minimalism vs. integration convenience**.

Kafka chose convenience (include SerDes in client). That works for them because 
their usage model demands it.

Iggy's SDK is cleaner without these concerns. But as we build the integration 
ecosystem (Flink, Spark, Beam), we need shared patterns.

The three-phase evolution (experimental → extracted → promoted) gives us time 
to learn what belongs where without premature commitment.

**For option_x_1.md**: The current design stands. No changes needed. This 
deliberation informs future decisions, not present ones.

---

**Status**: Decision made - stick with current structure, document evolution 
path
**Next review**: When Spark connector implementation begins
**Owner**: Chiradip
**Stakeholders**: Iggy maintainers, future Flink/Spark connector contributors


GitHub link: 
https://github.com/apache/iggy/discussions/2236#discussioncomment-14702253

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Re: Iggy Connector Architecture Involving Data/Stream Processors like Flink

Reply via email to