https://github.com/apache/pulsar/pull/25383

# PIP-465: Split IO Connectors into Separate Repository

# Background Knowledge

Apache Pulsar ships ~30 IO connectors (Kafka, Kinesis, Cassandra,
Elasticsearch, JDBC, Debezium,
etc.) as part of its main repository. These connectors are packaged as
NAR files and bundled into
a `pulsar-all` Docker image alongside the core broker, client, and
functions runtime.

Each connector brings its own dependency tree — often large and
conflicting with other connectors
or with Pulsar's core dependencies. The connectors interact with
Pulsar exclusively through the
stable `pulsar-io-core` API, making them natural candidates for
independent development and release.

# Motivation

The primary goal of this PIP is to **make development of Pulsar
easier** by shrinking the core
codebase. Removing ~30 connectors and their dependency trees from the
main repository will
massively improve compile time, test execution time, CI resource
consumption, and CI stability.

**Build and CI impact.** Compiling and packaging 30+ connector NARs
adds significant time to
every CI run and local build, even when a developer is only working on
the broker or client.
The connectors collectively bring hundreds of transitive dependencies
into the build graph,
which slows down dependency resolution, inflates vulnerability reports
(OWASP checks must scan
connector dependencies), and creates version conflicts that require
careful management in the
main repository's BOM. Removing them dramatically reduces the surface
area of the build.

**Release coupling.** Connectors are tied to the Pulsar release cycle.
A bug fix in a single
connector (e.g., updating the Elasticsearch client) requires waiting
for the next Pulsar release.
Conversely, a Pulsar patch release must rebuild all connectors even
when none of them changed.
The release cadence for connectors will be independent from Pulsar
releases, similar to what
we already do for client SDKs (Go, Python, Node.js).

**Low integration risk.** The `pulsar-io-core` API that connectors
depend on has been very
stable for a long time. There have been no breaking changes to the
connector API in years,
so there is essentially no risk of integration pain from this split.

**Docker image bloat.** The `pulsar-all` image bundles every connector
NAR, weighing in at
~2.9 GB — a very large image that most deployments don't need. Users
typically deploy only
1-2 connectors but pay the image pull cost for all of them. The main
reason users chose
`pulsar-all` over
`pulsar` was to get the tiered-storage offloaders — this PIP addresses
that by packaging the
offloader NARs directly into the `pulsar` image. Users who need
specific connectors can still
build tailored images by adding just the connector NARs they need on
top of `apachepulsar/pulsar`.

**Independent velocity.** Connector maintainers should be able to
release new connector versions
against a stable Pulsar API without coordinating with the core release train.

# Goals

## In Scope

- **Create `apache/pulsar-connectors` repository** containing all IO
connector modules, with
  their own Gradle build, version catalog, and CI pipeline. The
repository is forked from the
  main Pulsar repository to preserve full git history.

- **Remove connector modules from the main Pulsar repository.** Retain only:
  - `pulsar-io-core` (the connector API)
  - `pulsar-io-data-generator` (minimal connector used in integration tests)
  - The functions runtime and worker that load connectors at runtime

- **Remove the `pulsar-all` Docker image.** The image is too large and
most users don't need
  all connectors in a single image. The `pulsar` image becomes the
single official image.
  Tiered-storage offloader NARs — the main reason users chose
`pulsar-all` — are included
  directly in the `pulsar` image.

- **Independent connector releases.** The `pulsar-connectors`
repository has its own versioning
  and release cadence, independent from Pulsar releases — similar to
what we already do for
  client SDKs. It can release new connector versions against any
compatible Pulsar release.

- **Connector distribution packaging.** The connectors repository
produces a single release
  containing all connector NARs, as a distribution tarball that users
can deploy into an
  existing Pulsar installation.

## Out of Scope

- Changing the connector API (`pulsar-io-core`)
- Changing how the functions worker discovers and loads connector NARs
- A connector marketplace or registry (future enhancement)
- Splitting out tiered-storage offloaders into their own repository

# High Level Design

The split creates two repositories from what is currently one:

```
apache/pulsar (main repo)
├── pulsar-io/core/          # Connector API (retained)
├── pulsar-io/data-generator/ # Test connector (retained)
├── pulsar-functions/        # Runtime + worker (retained)
├── docker/pulsar/           # Single Docker image
└── (broker, client, etc.)

apache/pulsar-connectors (new repo)
├── aerospike/
├── aws/
├── cassandra/
├── debezium/
│   ├── core/
│   ├── mysql/
│   ├── postgres/
│   └── ...
├── elastic-search/
├── jdbc/
│   ├── core/
│   ├── postgres/
│   └── ...
├── kafka/
├── kafka-connect-adaptor/
├── kinesis/
├── rabbitmq/
├── ... (all other connectors)
├── distribution/io/         # Distribution packaging
└── docs/                    # Connector docs generation
```

The connectors repository consumes Pulsar artifacts (`pulsar-io-core`,
`pulsar-client`, etc.)
as external Maven dependencies, not as source dependencies. This
ensures connectors build against
the published API and don't accidentally depend on internals.

# Detailed Design

## Repository Structure

The new `pulsar-connectors` repository is forked from the main Pulsar
repository to preserve
git history, then trimmed to contain only connector-related modules.
Connectors are promoted
from nested `pulsar-io/<name>` paths to top-level `<name>/`
directories for a flatter structure.

## Build Configuration

The connectors repository has its own:
- `settings.gradle.kts` with all connector modules
- `gradle/libs.versions.toml` with connector-specific dependency versions
- `pulsar-dependencies/` platform module pinning Pulsar artifact versions
- `build.gradle.kts` root build with shared configuration

Pulsar core artifacts are declared as dependencies with a configurable version:
```kotlin
implementation("org.apache.pulsar:pulsar-io-core:${pulsarVersion}")
```

## Versioning Strategy

The connectors repository uses its own version scheme, independent of
Pulsar's version.
All connectors are released together as a single release (not
individually), and each
release specifies which Pulsar versions it is compatible with (e.g.,
"connectors 1.0.0
is compatible with Pulsar 4.x").

## Docker Image Changes

The `pulsar-all` image is removed. It bundled all connector NARs
alongside the broker,
producing a very large image that most deployments didn't need. The
main reason users chose
`pulsar-all` over `pulsar` was to get the tiered-storage offloaders.
With this change:

- Tiered-storage offloader NARs move into the `pulsar` image,
eliminating the primary reason
  for `pulsar-all` to exist
- The `pulsar` Docker image becomes the single official image,
containing the broker, functions
  runtime, and tiered-storage offloader NARs
- Users who need specific connectors can build tailored images by
adding just the connector
  NARs they need on top of `apachepulsar/pulsar`, or mount them via
volume mounts

## CI and Testing

- The main Pulsar repository's CI no longer builds or tests connectors
- The connectors repository has its own CI that builds and tests all connectors
- Integration tests that exercise specific connectors (e.g., Cassandra
sink, Kafka source)
  move to the connectors repository
- The main repository retains integration tests using `data-generator`
for testing the
  connector loading and runtime machinery

## Migration for Users

Users who currently use `pulsar-all` Docker image:
1. Switch to the `pulsar` Docker image
2. Download needed connector NARs from the connectors release
3. Mount NARs into the container (e.g., via volume mount to
`/pulsar/connectors/`)

Users who build from source:
1. Build the main Pulsar repository as before (faster, since
connectors are gone)
2. Build the connectors repository separately if needed

## Public-facing Changes

### Docker Images

| Before | After |
|--------|-------|
| `pulsar` — core only | `pulsar` — core + tiered-storage offloaders |
| `pulsar-all` — core + all connectors + offloaders | *(removed)* |

### Artifacts

- All connector NARs move from the main Pulsar release to a single
unified release from
  the `pulsar-connectors` repository
- All other Pulsar artifacts remain unchanged

### Configuration

No changes to broker, client, or functions worker configuration.

# Backward & Forward Compatibility

## Backward Compatibility

The connector API (`pulsar-io-core`) does not change. Existing
connector NARs continue
to work with the functions worker without modification.

The `pulsar-io-core` API has been very stable for years with no
breaking changes, so connectors
built against older API versions will continue to work with newer
Pulsar releases and vice versa.

## Forward Compatibility

New connector releases can target older Pulsar versions, as long as
the `pulsar-io-core`
API they depend on is compatible. Given the long track record of API
stability, this is
expected to work seamlessly across Pulsar 4.x releases.

# Security Considerations

No security implications. Connectors continue to be loaded through the
same NAR classloader
isolation mechanism. The split does not change the security model.

Separating connector dependencies from the main repository actually
improves security posture
by reducing the attack surface of the core Pulsar build and making
connector dependency
updates independently releasable.



--
Matteo Merli
<[email protected]>

Reply via email to