https://github.com/apache/pulsar/pull/25383

------

# PIP-465: Split IO Connectors into Separate Repository

# Background Knowledge

Apache Pulsar ships ~30 IO connectors (Kafka, Kinesis, Cassandra,
Elasticsearch, JDBC, Debezium,
etc.) as part of its main repository. These connectors are packaged as NAR
files and bundled into
a `pulsar-all` Docker image alongside the core broker, client, and
functions runtime.

Each connector brings its own dependency tree — often large and conflicting
with other connectors
or with Pulsar's core dependencies. The connectors interact with Pulsar
exclusively through the
stable `pulsar-io-core` API, making them natural candidates for independent
development and release.

# Motivation

The primary goal of this PIP is to **make development of Pulsar easier** by
shrinking the core
codebase. Removing ~30 connectors and their dependency trees from the main
repository will
massively improve compile time, test execution time, CI resource
consumption, and CI stability.

**Build and CI impact.** Compiling and packaging 30+ connector NARs adds
significant time to
every CI run and local build, even when a developer is only working on the
broker or client.
The connectors collectively bring hundreds of transitive dependencies into
the build graph,
which slows down dependency resolution, inflates vulnerability reports
(OWASP checks must scan
connector dependencies), and creates version conflicts that require careful
management in the
main repository's BOM. Removing them dramatically reduces the surface area
of the build.

**Release coupling.** Connectors are tied to the Pulsar release cycle. A
bug fix in a single
connector (e.g., updating the Elasticsearch client) requires waiting for
the next Pulsar release.
Conversely, a Pulsar patch release must rebuild all connectors even when
none of them changed.
The release cadence for connectors will be independent from Pulsar
releases, similar to what
we already do for client SDKs (Go, Python, Node.js).

**Low integration risk.** The `pulsar-io-core` API that connectors depend
on has been very
stable for a long time. There have been no breaking changes to the
connector API in years,
so there is essentially no risk of integration pain from this split.

**Docker image bloat.** The `pulsar-all` image bundles every connector NAR,
weighing in at
~2.9 GB — a very large image that most deployments don't need. Users
typically deploy only
1-2 connectors but pay the image pull cost for all of them. The main reason
users chose
`pulsar-all` over
`pulsar` was to get the tiered-storage offloaders — this PIP addresses that
by packaging the
offloader NARs directly into the `pulsar` image. Users who need specific
connectors can still
build tailored images by adding just the connector NARs they need on top of
`apachepulsar/pulsar`.

**Independent velocity.** Connector maintainers should be able to release
new connector versions
against a stable Pulsar API without coordinating with the core release
train.

# Goals

## In Scope

- **Create `apache/pulsar-connectors` repository** containing all IO
connector modules, with
  their own Gradle build, version catalog, and CI pipeline. The repository
is forked from the
  main Pulsar repository to preserve full git history.

- **Remove connector modules from the main Pulsar repository.** Retain only:
  - `pulsar-io-core` (the connector API)
  - `pulsar-io-data-generator` (minimal connector used in integration tests)
  - The functions runtime and worker that load connectors at runtime

- **Remove the `pulsar-all` Docker image.** The image is too large and most
users don't need
  all connectors in a single image. The `pulsar` image becomes the single
official image.
  Tiered-storage offloader NARs — the main reason users chose `pulsar-all`
— are included
  directly in the `pulsar` image.

- **Independent connector releases.** The `pulsar-connectors` repository
has its own versioning
  and release cadence, independent from Pulsar releases — similar to what
we already do for
  client SDKs. It can release new connector versions against any compatible
Pulsar release.

- **Connector distribution packaging.** The connectors repository produces
a single release
  containing all connector NARs, as a distribution tarball that users can
deploy into an
  existing Pulsar installation.

## Out of Scope

- Changing the connector API (`pulsar-io-core`)
- Changing how the functions worker discovers and loads connector NARs
- A connector marketplace or registry (future enhancement)
- Splitting out tiered-storage offloaders into their own repository

# High Level Design

The split creates two repositories from what is currently one:

```
apache/pulsar (main repo)
├── pulsar-io/core/          # Connector API (retained)
├── pulsar-io/data-generator/ # Test connector (retained)
├── pulsar-functions/        # Runtime + worker (retained)
├── docker/pulsar/           # Single Docker image
└── (broker, client, etc.)

apache/pulsar-connectors (new repo)
├── aerospike/
├── aws/
├── cassandra/
├── debezium/
│   ├── core/
│   ├── mysql/
│   ├── postgres/
│   └── ...
├── elastic-search/
├── jdbc/
│   ├── core/
│   ├── postgres/
│   └── ...
├── kafka/
├── kafka-connect-adaptor/
├── kinesis/
├── rabbitmq/
├── ... (all other connectors)
├── distribution/io/         # Distribution packaging
└── docs/                    # Connector docs generation
```

The connectors repository consumes Pulsar artifacts (`pulsar-io-core`,
`pulsar-client`, etc.)
as external Maven dependencies, not as source dependencies. This ensures
connectors build against
the published API and don't accidentally depend on internals.

# Detailed Design

## Repository Structure

The new `pulsar-connectors` repository is forked from the main Pulsar
repository to preserve
git history, then trimmed to contain only connector-related modules.
Connectors are promoted
from nested `pulsar-io/<name>` paths to top-level `<name>/` directories for
a flatter structure.

## Build Configuration

The connectors repository has its own:
- `settings.gradle.kts` with all connector modules
- `gradle/libs.versions.toml` with connector-specific dependency versions
- `pulsar-dependencies/` platform module pinning Pulsar artifact versions
- `build.gradle.kts` root build with shared configuration

Pulsar core artifacts are declared as dependencies with a configurable
version:
```kotlin
implementation("org.apache.pulsar:pulsar-io-core:${pulsarVersion}")
```

## Versioning Strategy

The initial release of `pulsar-connectors` will use the same version as the
next Pulsar
release (whether that is 4.3 or 5.0), to make the transition clear. After
that, the
connectors repository follows its own independent release cadence.
All connectors are released together as a single release (not
individually), and each
release specifies which Pulsar versions it is compatible with.

## Docker Image Changes

The `pulsar-all` image is removed. It bundled all connector NARs alongside
the broker,
producing a very large image that most deployments didn't need. The main
reason users chose
`pulsar-all` over `pulsar` was to get the tiered-storage offloaders. With
this change:

- Tiered-storage offloader NARs move into the `pulsar` image, eliminating
the primary reason
  for `pulsar-all` to exist
- The `pulsar` Docker image becomes the single official image, containing
the broker, functions
  runtime, and tiered-storage offloader NARs
- Users who need specific connectors can build tailored images by adding
just the connector
  NARs they need on top of `apachepulsar/pulsar`, or mount them via volume
mounts

## CI and Testing

- The main Pulsar repository's CI no longer builds or tests connectors
- The connectors repository has its own CI that builds and tests all
connectors
- Integration tests that exercise specific connectors (e.g., Cassandra
sink, Kafka source)
  move to the connectors repository
- The main repository retains integration tests using `data-generator` for
testing the
  connector loading and runtime machinery

## Migration for Users

Users who currently use `pulsar-all` Docker image:
1. Switch to the `pulsar` Docker image
2. Download needed connector NARs from the connectors release
3. Mount NARs into the container (e.g., via volume mount to
`/pulsar/connectors/`)

Users who build from source:
1. Build the main Pulsar repository as before (faster, since connectors are
gone)
2. Build the connectors repository separately if needed

## Public-facing Changes

### Docker Images

| Before | After |
|--------|-------|
| `pulsar` — core only | `pulsar` — core + tiered-storage offloaders |
| `pulsar-all` — core + all connectors + offloaders | *(removed)* |

### Artifacts

- All connector NARs move from the main Pulsar release to a single unified
release from
  the `pulsar-connectors` repository
- All other Pulsar artifacts remain unchanged

### Configuration

No changes to broker, client, or functions worker configuration.

# Backward & Forward Compatibility

## Backward Compatibility

The connector API (`pulsar-io-core`) does not change. Existing connector
NARs continue
to work with the functions worker without modification.

The `pulsar-io-core` API has been very stable for years with no breaking
changes, so connectors
built against older API versions will continue to work with newer Pulsar
releases and vice versa.

## Forward Compatibility

New connector releases can target older Pulsar versions, as long as the
`pulsar-io-core`
API they depend on is compatible. Given the long track record of API
stability, this is
expected to work seamlessly across Pulsar 4.x releases.

# Security Considerations

No security implications. Connectors continue to be loaded through the same
NAR classloader
isolation mechanism. The split does not change the security model.

Separating connector dependencies from the main repository actually
improves security posture
by reducing the attack surface of the core Pulsar build and making
connector dependency
updates independently releasable.

--
Matteo Merli
<[email protected]>

Reply via email to