Re: [DISCUSS] Spark version support strategy

Anton Okolnychyi Tue, 14 Sep 2021 16:54:46 -0700

To sum up what we have so far:


Option 1 (support just the most recent minor Spark 3 version)

The easiest option for us devs, forces the user to upgrade to the most recent 
minor Spark version to consume any new Iceberg features.

Option 2 (a separate project under Iceberg)

Can support as many Spark versions as needed and the codebase is still separate 
as we can use separate branches.
Impossible to consume any unreleased changes in core, may slow down the 
development.

Option 3 (separate modules for Spark 3.1/3.2)

Introduce more modules in the same project.
Can consume unreleased changes but it will required at least 5 modules to 
support 2.4, 3.1 and 3.2, making the build and testing complicated.


Are there any users for whom upgrading the minor Spark version (e3.1 to 3.2) to 
consume new features is a blocker?
We follow Option 1 internally at the moment but I would like to hear what other 
people think/need.

- Anton


> On 14 Sep 2021, at 09:44, Russell Spitzer <russell.spit...@gmail.com> wrote:
> 
> I think we should go for option 1. I already am not a big fan of having 
> runtime errors for unsupported things based on versions and I don't think 
> minor version upgrades are a large issue for users.  I'm especially not 
> looking forward to supporting interfaces that only exist in Spark 3.2 in a 
> multiple Spark version support future.
> 
>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi 
>> <aokolnyc...@apple.com.INVALID <mailto:aokolnyc...@apple.com.INVALID>> wrote:
>> 
>>> First of all, is option 2 a viable option? We discussed separating the 
>>> python module outside of the project a few weeks ago, and decided to not do 
>>> that because it's beneficial for code cross reference and more intuitive 
>>> for new developers to see everything in the same repository. I would expect 
>>> the same argument to also hold here. 
>> 
>> That’s exactly the concern I have about Option 2 at this moment.
>> 
>>> Overall I would personally prefer us to not support all the minor versions, 
>>> but instead support maybe just 2-3 latest versions in a major version. 
>> 
>> This is when it gets a bit complicated. If we want to support both Spark 3.1 
>> and Spark 3.2 with a single module, it means we have to compile against 3.1. 
>> The problem is that we rely on DSv2 that is being actively developed. 3.2 
>> and 3.1 have substantial differences. On top of that, we have our extensions 
>> that are extremely low-level and may break not only between minor versions 
>> but also between patch releases.
>> 
>>> f there are some features requiring a newer version, it makes sense to move 
>>> that newer version in master.
>> 
>> Internally, we don’t deliver new features to older Spark versions as it 
>> requires a lot of effort to port things. Personally, I don’t think it is too 
>> bad to require users to upgrade if they want new features. At the same time, 
>> there are valid concerns with this approach too that we mentioned during the 
>> sync. For example, certain new features would also work fine with older 
>> Spark versions. I generally agree with that and that not supporting recent 
>> versions is not ideal. However, I want to find a balance between the 
>> complexity on our side and ease of use for the users. Ideally, supporting a 
>> few recent versions would be sufficient but our Spark integration is too 
>> low-level to do that with a single module.
>>  
>> 
>>> On 13 Sep 2021, at 20:53, Jack Ye <yezhao...@gmail.com 
>>> <mailto:yezhao...@gmail.com>> wrote:
>>> 
>>> First of all, is option 2 a viable option? We discussed separating the 
>>> python module outside of the project a few weeks ago, and decided to not do 
>>> that because it's beneficial for code cross reference and more intuitive 
>>> for new developers to see everything in the same repository. I would expect 
>>> the same argument to also hold here. 
>>> 
>>> Overall I would personally prefer us to not support all the minor versions, 
>>> but instead support maybe just 2-3 latest versions in a major version. This 
>>> avoids the problem that some users are unwilling to move to a newer version 
>>> and keep patching old Spark version branches. If there are some features 
>>> requiring a newer version, it makes sense to move that newer version in 
>>> master.
>>> 
>>> In addition, because currently Spark is considered the most 
>>> feature-complete reference implementation compared to all other engines, I 
>>> think we should not add artificial barriers that would slow down its 
>>> development speed.
>>> 
>>> So my thinking is closer to option 1.
>>> 
>>> Best,
>>> Jack Ye
>>> 
>>> 
>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi 
>>> <aokolnyc...@apple.com.invalid <mailto:aokolnyc...@apple.com.invalid>> 
>>> wrote:
>>> Hey folks,
>>> 
>>> I want to discuss our Spark version support strategy.
>>> 
>>> So far, we have tried to support both 3.0 and 3.1. It is great to support 
>>> older versions but because we compile against 3.0, we cannot use any Spark 
>>> features that are offered in newer versions.
>>> Spark 3.2 is just around the corner and it brings a lot of important 
>>> features such dynamic filtering for v2 tables, required distribution and 
>>> ordering for writes, etc. These features are too important to ignore them.
>>> 
>>> Apart from that, I have an end-to-end prototype for merge-on-read with 
>>> Spark that actually leverages some of the 3.2 features. I’ll be 
>>> implementing all new Spark DSv2 APIs for us internally and would love to 
>>> share that with the rest of the community.
>>> 
>>> I see two options to move forward:
>>> 
>>> Option 1
>>> 
>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while by releasing 
>>> minor versions with bug fixes.
>>> 
>>> Pros: almost no changes to the build configuration, no extra work on our 
>>> side as just a single Spark version is actively maintained.
>>> Cons: some new features that we will be adding to master could also work 
>>> with older Spark versions but all 0.12 releases will only contain bug 
>>> fixes. Therefore, users will be forced to migrate to Spark 3.2 to consume 
>>> any new Spark or format features.
>>> 
>>> Option 2
>>> 
>>> Move our Spark integration into a separate project and introduce branches 
>>> for 3.0, 3.1 and 3.2.
>>> 
>>> Pros: decouples the format version from Spark, we can support as many Spark 
>>> versions as needed.
>>> Cons: more work initially to set everything up, more work to release, will 
>>> need a new release of the core format to consume any changes in the Spark 
>>> integration.
>>> 
>>> Overall, I think option 2 seems better for the user but my main worry is 
>>> that we will have to release the format more frequently (which is a good 
>>> thing but requires more work and time) and the overall Spark development 
>>> may be slower.
>>> 
>>> I’d love to hear what everybody thinks about this matter.
>>> 
>>> Thanks,
>>> Anton
>> 
>

Re: [DISCUSS] Spark version support strategy

Reply via email to