+1 (binding)

-Lari

Sidenote: One small detail to take into account while moving Pulsar IO 
connectors is to retain git history so that "git blame" could be used to 
understand the reasons of the code. That's possible by rewriting the git 
history and filtering all other directories than the directories to retain. The 
process used for apache/pulsar-sql is documented here: 
https://github.com/apache/pulsar-sql#extracting-the-pulsar-sql-modules-with-preserved-commit-history
 . A similar solution could be used for apache/pulsar-connectors to retain git 
history.  I can volunteer to perfom this step.

On 2026/03/24 20:12:19 Matteo Merli wrote:
> https://github.com/apache/pulsar/pull/25383
> 
> ------
> 
> # PIP-465: Split IO Connectors into Separate Repository
> 
> # Background Knowledge
> 
> Apache Pulsar ships ~30 IO connectors (Kafka, Kinesis, Cassandra,
> Elasticsearch, JDBC, Debezium,
> etc.) as part of its main repository. These connectors are packaged as NAR
> files and bundled into
> a `pulsar-all` Docker image alongside the core broker, client, and
> functions runtime.
> 
> Each connector brings its own dependency tree — often large and conflicting
> with other connectors
> or with Pulsar's core dependencies. The connectors interact with Pulsar
> exclusively through the
> stable `pulsar-io-core` API, making them natural candidates for independent
> development and release.
> 
> # Motivation
> 
> The primary goal of this PIP is to **make development of Pulsar easier** by
> shrinking the core
> codebase. Removing ~30 connectors and their dependency trees from the main
> repository will
> massively improve compile time, test execution time, CI resource
> consumption, and CI stability.
> 
> **Build and CI impact.** Compiling and packaging 30+ connector NARs adds
> significant time to
> every CI run and local build, even when a developer is only working on the
> broker or client.
> The connectors collectively bring hundreds of transitive dependencies into
> the build graph,
> which slows down dependency resolution, inflates vulnerability reports
> (OWASP checks must scan
> connector dependencies), and creates version conflicts that require careful
> management in the
> main repository's BOM. Removing them dramatically reduces the surface area
> of the build.
> 
> **Release coupling.** Connectors are tied to the Pulsar release cycle. A
> bug fix in a single
> connector (e.g., updating the Elasticsearch client) requires waiting for
> the next Pulsar release.
> Conversely, a Pulsar patch release must rebuild all connectors even when
> none of them changed.
> The release cadence for connectors will be independent from Pulsar
> releases, similar to what
> we already do for client SDKs (Go, Python, Node.js).
> 
> **Low integration risk.** The `pulsar-io-core` API that connectors depend
> on has been very
> stable for a long time. There have been no breaking changes to the
> connector API in years,
> so there is essentially no risk of integration pain from this split.
> 
> **Docker image bloat.** The `pulsar-all` image bundles every connector NAR,
> weighing in at
> ~2.9 GB — a very large image that most deployments don't need. Users
> typically deploy only
> 1-2 connectors but pay the image pull cost for all of them. The main reason
> users chose
> `pulsar-all` over
> `pulsar` was to get the tiered-storage offloaders — this PIP addresses that
> by packaging the
> offloader NARs directly into the `pulsar` image. Users who need specific
> connectors can still
> build tailored images by adding just the connector NARs they need on top of
> `apachepulsar/pulsar`.
> 
> **Independent velocity.** Connector maintainers should be able to release
> new connector versions
> against a stable Pulsar API without coordinating with the core release
> train.
> 
> # Goals
> 
> ## In Scope
> 
> - **Create `apache/pulsar-connectors` repository** containing all IO
> connector modules, with
>   their own Gradle build, version catalog, and CI pipeline. The repository
> is forked from the
>   main Pulsar repository to preserve full git history.
> 
> - **Remove connector modules from the main Pulsar repository.** Retain only:
>   - `pulsar-io-core` (the connector API)
>   - `pulsar-io-data-generator` (minimal connector used in integration tests)
>   - The functions runtime and worker that load connectors at runtime
> 
> - **Remove the `pulsar-all` Docker image.** The image is too large and most
> users don't need
>   all connectors in a single image. The `pulsar` image becomes the single
> official image.
>   Tiered-storage offloader NARs — the main reason users chose `pulsar-all`
> — are included
>   directly in the `pulsar` image.
> 
> - **Independent connector releases.** The `pulsar-connectors` repository
> has its own versioning
>   and release cadence, independent from Pulsar releases — similar to what
> we already do for
>   client SDKs. It can release new connector versions against any compatible
> Pulsar release.
> 
> - **Connector distribution packaging.** The connectors repository produces
> a single release
>   containing all connector NARs, as a distribution tarball that users can
> deploy into an
>   existing Pulsar installation.
> 
> ## Out of Scope
> 
> - Changing the connector API (`pulsar-io-core`)
> - Changing how the functions worker discovers and loads connector NARs
> - A connector marketplace or registry (future enhancement)
> - Splitting out tiered-storage offloaders into their own repository
> 
> # High Level Design
> 
> The split creates two repositories from what is currently one:
> 
> ```
> apache/pulsar (main repo)
> ├── pulsar-io/core/          # Connector API (retained)
> ├── pulsar-io/data-generator/ # Test connector (retained)
> ├── pulsar-functions/        # Runtime + worker (retained)
> ├── docker/pulsar/           # Single Docker image
> └── (broker, client, etc.)
> 
> apache/pulsar-connectors (new repo)
> ├── aerospike/
> ├── aws/
> ├── cassandra/
> ├── debezium/
> │   ├── core/
> │   ├── mysql/
> │   ├── postgres/
> │   └── ...
> ├── elastic-search/
> ├── jdbc/
> │   ├── core/
> │   ├── postgres/
> │   └── ...
> ├── kafka/
> ├── kafka-connect-adaptor/
> ├── kinesis/
> ├── rabbitmq/
> ├── ... (all other connectors)
> ├── distribution/io/         # Distribution packaging
> └── docs/                    # Connector docs generation
> ```
> 
> The connectors repository consumes Pulsar artifacts (`pulsar-io-core`,
> `pulsar-client`, etc.)
> as external Maven dependencies, not as source dependencies. This ensures
> connectors build against
> the published API and don't accidentally depend on internals.
> 
> # Detailed Design
> 
> ## Repository Structure
> 
> The new `pulsar-connectors` repository is forked from the main Pulsar
> repository to preserve
> git history, then trimmed to contain only connector-related modules.
> Connectors are promoted
> from nested `pulsar-io/<name>` paths to top-level `<name>/` directories for
> a flatter structure.
> 
> ## Build Configuration
> 
> The connectors repository has its own:
> - `settings.gradle.kts` with all connector modules
> - `gradle/libs.versions.toml` with connector-specific dependency versions
> - `pulsar-dependencies/` platform module pinning Pulsar artifact versions
> - `build.gradle.kts` root build with shared configuration
> 
> Pulsar core artifacts are declared as dependencies with a configurable
> version:
> ```kotlin
> implementation("org.apache.pulsar:pulsar-io-core:${pulsarVersion}")
> ```
> 
> ## Versioning Strategy
> 
> The initial release of `pulsar-connectors` will use the same version as the
> next Pulsar
> release (whether that is 4.3 or 5.0), to make the transition clear. After
> that, the
> connectors repository follows its own independent release cadence.
> All connectors are released together as a single release (not
> individually), and each
> release specifies which Pulsar versions it is compatible with.
> 
> ## Docker Image Changes
> 
> The `pulsar-all` image is removed. It bundled all connector NARs alongside
> the broker,
> producing a very large image that most deployments didn't need. The main
> reason users chose
> `pulsar-all` over `pulsar` was to get the tiered-storage offloaders. With
> this change:
> 
> - Tiered-storage offloader NARs move into the `pulsar` image, eliminating
> the primary reason
>   for `pulsar-all` to exist
> - The `pulsar` Docker image becomes the single official image, containing
> the broker, functions
>   runtime, and tiered-storage offloader NARs
> - Users who need specific connectors can build tailored images by adding
> just the connector
>   NARs they need on top of `apachepulsar/pulsar`, or mount them via volume
> mounts
> 
> ## CI and Testing
> 
> - The main Pulsar repository's CI no longer builds or tests connectors
> - The connectors repository has its own CI that builds and tests all
> connectors
> - Integration tests that exercise specific connectors (e.g., Cassandra
> sink, Kafka source)
>   move to the connectors repository
> - The main repository retains integration tests using `data-generator` for
> testing the
>   connector loading and runtime machinery
> 
> ## Migration for Users
> 
> Users who currently use `pulsar-all` Docker image:
> 1. Switch to the `pulsar` Docker image
> 2. Download needed connector NARs from the connectors release
> 3. Mount NARs into the container (e.g., via volume mount to
> `/pulsar/connectors/`)
> 
> Users who build from source:
> 1. Build the main Pulsar repository as before (faster, since connectors are
> gone)
> 2. Build the connectors repository separately if needed
> 
> ## Public-facing Changes
> 
> ### Docker Images
> 
> | Before | After |
> |--------|-------|
> | `pulsar` — core only | `pulsar` — core + tiered-storage offloaders |
> | `pulsar-all` — core + all connectors + offloaders | *(removed)* |
> 
> ### Artifacts
> 
> - All connector NARs move from the main Pulsar release to a single unified
> release from
>   the `pulsar-connectors` repository
> - All other Pulsar artifacts remain unchanged
> 
> ### Configuration
> 
> No changes to broker, client, or functions worker configuration.
> 
> # Backward & Forward Compatibility
> 
> ## Backward Compatibility
> 
> The connector API (`pulsar-io-core`) does not change. Existing connector
> NARs continue
> to work with the functions worker without modification.
> 
> The `pulsar-io-core` API has been very stable for years with no breaking
> changes, so connectors
> built against older API versions will continue to work with newer Pulsar
> releases and vice versa.
> 
> ## Forward Compatibility
> 
> New connector releases can target older Pulsar versions, as long as the
> `pulsar-io-core`
> API they depend on is compatible. Given the long track record of API
> stability, this is
> expected to work seamlessly across Pulsar 4.x releases.
> 
> # Security Considerations
> 
> No security implications. Connectors continue to be loaded through the same
> NAR classloader
> isolation mechanism. The split does not change the security model.
> 
> Separating connector dependencies from the main repository actually
> improves security posture
> by reducing the attack surface of the core Pulsar build and making
> connector dependency
> updates independently releasable.
> 
> --
> Matteo Merli
> <[email protected]>
> 

Reply via email to