Hey all,

There was talk earlier this year about resurrecting the effort to add direct 
Spark readers and writers to Druid. Rather than repeat the previous attempt and 
parachute in with updated connectors, I’d like to start by building a little 
more consensus around what the Druid dev community wants as potential 
maintainers.

To begin with, I want to solicit opinions on two topics:

Should these connectors be written in Scala or Java? The benefits of Scala 
would be that the existing connectors are written in Scala, as are most open 
source references for Spark Datasource V2 implementations. The benefits of Java 
are that Druid is written in Java, and so engineers interested in contributing 
to Druid wouldn’t need to switch between languages. Additionally, existing 
tooling, static checkers, etc. could be used with minimal effort, conforming 
code style and developer ergonomics across Druid instead of needing to keep an 
alternate Scala tool chain in sync.
Which Spark version should this effort target? The most recently released 
version of Spark is 3.4.1. Should we aim to integrate with the latest Spark 
minor version under the assumption that this will give us the longest window of 
support, or should we build against an older minor line (3.3? 3.2?) since most 
Spark users tend to lag? For reference, there are currently 3 stable Spark 
release versions, 3.2.4, 3.3.2, and 3.4.1. From a user’s point of view, the API 
is mostly compatible across a major version (i.e. 3.x), while developer APIs 
such as the ones we would use to build these connectors can change between 
minor versions.
There are quite a few nuances and trade offs inherent to the decisions above, 
and my hope is that by hashing these choices out before presenting an 
implementation we can build buy-in from the Druid maintainer community that 
will result in this effort succeeding where the first effort failed.

Thanks,
Julian

Reply via email to