Hi Spark community,
I'd like to propose creating a dedicated Apache Spark Connect client for
Scala 3, maintained as a separate project from the core Spark repository.
This initiative aims to provide first-class Scala 3 support while
maintaining full compatibility with the Spark Connect protocol.
Motivation
The Scala ecosystem is rapidly adopting Scala 3, with many organizations
and libraries making the transition. Currently, Spark 4.x's support for
Scala is limited to 2.12 and 2.13, and adding Scala 3 support to the main
repository presents several challenges:
1. *Cross-compilation complexity*: Supporting Scala 2.12, 2.13, and 3.x
simultaneously significantly increases build complexity and maintenance
burden
2. *Language feature utilization*: Scala 3's new features (contextual
abstractions, union types, improved metaprogramming) cannot be fully
leveraged in cross-compiled code
3. *Dependency management*: Different Scala versions often require
different dependency versions, complicating the build
4. *Development velocity*: Changes require extensive testing across all
Scala versions
Why Spark Connect?
Spark Connect's decoupled architecture makes it an ideal candidate for this
approach:
- *Protocol-based*: Communication via gRPC means the client and server
can use different Scala versions
- *Reduced surface area*: Connect client is focused in scope compared to
full Spark
- *Clear compatibility target*: Protocol specification provides clear
compatibility requirements
- *Growing adoption*: Spark Connect is becoming the recommended way to
build Spark applications
Proposed Approach
1. *Separate Repository*: Create a new repository (e.g.,
apache/spark-connect-scala3 or within existing Spark org structure)
1. *Independent Release Cycle*: Version aligned with Spark Connect
protocol versions rather than Spark releases
- Example: Client 1.0.x supports Connect Protocol 4.0
- Example: Client 1.1.x supports Connect Protocol 4.1
1. *Governance*:
- Maintain under Apache Spark project governance
- Start with dedicated maintainers interested in Scala 3
- Regular sync with Spark Connect core team
1. *Scope*:
- Full DataFrame and DataSet API support
- SQL interface
- UDF support with Scala 3 features
- Streaming capabilities
- Focus on client-side only (no server/cluster changes)
Benefits
*For Users:*
- Native Scala 3 development experience
- Access to modern Scala ecosystem and tooling
- Improved compile times and IDE support
- Gradual migration path from Scala 2.x applications
*For Spark Project:*
- Expanded reach into Scala 3 community
- Reduced complexity in core repository
- Testing ground for new client-side features
- Community-driven development reducing core team burden
Potential Concerns and Mitigation
*Fragmentation:* We'll ensure strict protocol compatibility and extensive
testing against Spark releases. The API will remain familiar to existing
Spark users.
*Maintenance:* By engaging the Scala 3 community early and establishing
clear contribution guidelines, we can build a sustainable maintenance model.
*Duplication:* While some code structure will be similar, the
implementation can leverage Scala 3 features for cleaner, more maintainable
code.
Next Steps
If there's interest, I propose:
1. Gathering feedback on this approach.
2. Creating a detailed SPIP (Spark Improvement Proposal) if consensus is
positive
3. Setting up initial project structure with interested contributors
4. Developing a proof-of-concept implementation
Looking forward to your thoughts and feedback. I believe this approach
balances the needs of the Scala 3 community with the stability requirements
of the core Spark project.