Re: [DISCUSS] Spark version support strategy

Russell Spitzer Wed, 15 Sep 2021 15:08:24 -0700

I agree that Option 2 is considerably more difficult for development when core 
API changes need to be picked up by the external Spark module. I also think a 
monthly release would probably still be prohibitive to actually implementing 
new features that appear in the API, I would hope we have a much faster process 
or maybe just have snapshot artifacts published nightly?


> On Sep 15, 2021, at 4:46 PM, Wing Yew Poon <[email protected]> 
> wrote:
> 
> IIUC, Option 2 is to move the Spark support for Iceberg into a separate repo 
> (subproject of Iceberg). Would we have branches such as 0.13-2.4, 0.13-3.0, 
> 0.13-3.1, and 0.13-3.2? For features that can be supported in all versions or 
> all Spark 3 versions, then we would need to commit the changes to all 
> applicable branches. Basically we are trading more work to commit to multiple 
> branches for simplified build and CI time per branch, which might be an 
> acceptable trade-off. However, the biggest downside is that changes may need 
> to be made in core Iceberg as well as in the engine (in this case Spark) 
> support, and we need to wait for a release of core Iceberg to consume the 
> changes in the subproject. In this case, maybe we should have a monthly 
> release of core Iceberg (no matter how many changes go in, as long as it is 
> non-zero) so that the subproject can consume changes fairly quickly?
> 
> 
> On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue <[email protected] 
> <mailto:[email protected]>> wrote:
> Thanks for bringing this up, Anton. I’m glad that we have the set of 
> potential solutions well defined.
> 
> Looks like the next step is to decide whether we want to require people to 
> update Spark versions to pick up newer versions of Iceberg. If we choose to 
> make people upgrade, then option 1 is clearly the best choice.
> 
> I don’t think that we should make updating Spark a requirement. Many of the 
> things that we’re working on are orthogonal to Spark versions, like table 
> maintenance actions, secondary indexes, the 1.0 API, views, ORC delete files, 
> new storage implementations, etc. Upgrading Spark is time consuming and 
> untrusted in my experience, so I think we would be setting up an unnecessary 
> trade-off between spending lots of time to upgrade Spark and picking up new 
> Iceberg features.
> 
> Another way of thinking about this is that if we went with option 1, then we 
> could port bug fixes into 0.12.x. But there are many things that wouldn’t fit 
> this model, like adding a FileIO implementation for ADLS. So some people in 
> the community would have to maintain branches of newer Iceberg versions with 
> older versions of Spark outside of the main Iceberg project — that defeats 
> the purpose of simplifying things with option 1 because we would then have 
> more people maintaining the same 0.13.x with Spark 3.1 branch. (This reminds 
> me of the Spark community, where we wanted to release a 2.5 line with DSv2 
> backported, but the community decided not to so we built similar 2.4+DSv2 
> branches at Netflix, Tencent, Apple, etc.)
> 
> If the community is going to do the work anyway — and I think some of us 
> would — we should make it possible to share that work. That’s why I don’t 
> think that we should go with option 1.
> 
> If we don’t go with option 1, then the choice is how to maintain multiple 
> Spark versions. I think that the way we’re doing it right now is not 
> something we want to continue.
> 
> Using multiple modules (option 3) is concerning to me because of the changes 
> in Spark. We currently structure the library to share as much code as 
> possible. But that means compiling against different Spark versions and 
> relying on binary compatibility and reflection in some cases. To me, this 
> seems unmaintainable in the long run because it requires refactoring common 
> classes and spending a lot of time deduplicating code. It also creates a ton 
> of modules, at least one common module, then a module per version, then an 
> extensions module per version, and finally a runtime module per version. 
> That’s 3 modules per Spark version, plus any new common modules. And each 
> module needs to be tested, which is making our CI take a really long time. We 
> also don’t support multiple Scala versions, which is another gap that will 
> require even more modules and tests.
> 
> I like option 2 because it would allow us to compile against a single version 
> of Spark (which will be much more reliable). It would give us an opportunity 
> to support different Scala versions. It avoids the need to refactor to share 
> code and allows people to focus on a single version of Spark, while also 
> creating a way for people to maintain and update the older versions with 
> newer Iceberg releases. I don’t think that this would slow down development. 
> I think it would actually speed it up because we’d be spending less time 
> trying to make multiple versions work in the same build. And anyone in favor 
> of option 1 would basically get option 1: you don’t have to care about 
> branches for older Spark versions.
> 
> Jack makes a good point about wanting to keep code in a single repository, 
> but I think that the need to manage more version combinations overrides this 
> concern. It’s easier to make this decision in python because we’re not trying 
> to depend on two projects that change relatively quickly. We’re just trying 
> to build a library.
> 
> Ryan
> 
> 
> On Wed, Sep 15, 2021 at 2:58 AM OpenInx <[email protected] 
> <mailto:[email protected]>> wrote:
> Thanks for bringing this up,  Anton. 
> 
> Everyone has great pros/cons to support their preferences.  Before giving my 
> preference, let me raise one question:    what's the top priority thing for 
> apache iceberg project at this point in time ?  This question will help us to 
> answer the following question: Should we support more engine versions more 
> robustly or be a bit more aggressive and concentrate on getting the new 
> features that users need most in order to keep the project more competitive ? 
> 
> If people watch the apache iceberg project and check the issues & PR 
> frequently,  I guess more than 90% people will answer the priority question:  
>  There is no doubt for making the whole v2 story to be production-ready.   
> The current roadmap discussion also proofs the thing : 
> https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E
>  
> <https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E>
>  .   
> 
> In order to ensure the highest priority at this point in time, I will prefer 
> option-1 to reduce the cost of engine maintenance, so as to free up resources 
> to make v2 production-ready. 
> 
> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao <[email protected] 
> <mailto:[email protected]>> wrote:
> From Dev's point, it has less burden to always support the latest version of 
> Spark (for example). But from user's point, especially for us who maintain 
> Spark internally, it is not easy to upgrade the Spark version for the first 
> time (since we have many customizations internally), and we're still 
> promoting to upgrade to 3.1.2. If the community ditches the support of old 
> version of Spark3, users have to maintain it themselves unavoidably. 
> 
> So I'm inclined to make this support in community, not by users themselves, 
> as for Option 2 or 3, I'm fine with either. And to relieve the burden, we 
> could support limited versions of Spark (for example 2 versions).
> 
> Just my two cents.
> 
> -Saisai
> 
> 
> Jack Ye <[email protected] <mailto:[email protected]>> 于2021年9月15日周三 
> 下午1:35写道：
> Hi Wing Yew,
> 
> I think 2.4 is a different story, we will continue to support Spark 2.4, but 
> as you can see it will continue to have very limited functionalities 
> comparing to Spark 3. I believe we discussed about option 3 when we were 
> doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the same issue for 
> Flink 1.11, 1.12 and 1.13 as well. I feel we need a consistent strategy 
> around this, let's take this chance to make a good community guideline for 
> all future engine versions, especially for Spark, Flink and Hive that are in 
> the same repository.
> 
> I can totally understand your point of view Wing, in fact, speaking from the 
> perspective of AWS EMR, we have to support over 40 versions of the software 
> because there are people who are still using Spark 1.4, believe it or not. 
> After all, keep backporting changes will become a liability not only on the 
> user side, but also on the service provider side, so I believe it's not a bad 
> practice to push for user upgrade, as it will make the life of both parties 
> easier in the end. New feature is definitely one of the best incentives to 
> promote an upgrade on user side.
> 
> I think the biggest issue of option 3 is about its scalability, because we 
> will have an unbounded list of packages to add and compile in the future, and 
> we probably cannot drop support of that package once created. If we go with 
> option 1, I think we can still publish a few patch versions for old Iceberg 
> releases, and committers can control the amount of patch versions to guard 
> people from abusing the power of patching. I see this as a consistent 
> strategy also for Flink and Hive. With this strategy, we can truly have a 
> compatibility matrix for engine versions against Iceberg versions.
> 
> -Jack
> 
> 
> 
> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon <[email protected]> 
> wrote:
> I understand and sympathize with the desire to use new DSv2 features in Spark 
> 3.2. I agree that Option 1 is the easiest for developers, but I don't think 
> it considers the interests of users. I do not think that most users will 
> upgrade to Spark 3.2 as soon as it is released. It is a "minor version" 
> upgrade in name from 3.1 (or from 3.0), but I think we all know that it is 
> not a minor upgrade. There are a lot of changes from 3.0 to 3.1 and from 3.1 
> to 3.2. I think there are even a lot of users running Spark 2.4 and not even 
> on Spark 3 yet. Do we also plan to stop supporting Spark 2.4?
> 
> Please correct me if I'm mistaken, but the folks who have spoken out in favor 
> of Option 1 all work for the same organization, don't they? And they don't 
> have a problem with making their users, all internal, simply upgrade to Spark 
> 3.2, do they? (Or they are already running an internal fork that is close to 
> 3.2.)
> 
> I work for an organization with customers running different versions of 
> Spark. It is true that we can backport new features to older versions if we 
> wanted to. I suppose the people contributing to Iceberg work for some 
> organization or other that either use Iceberg in-house, or provide software 
> (possibly in the form of a service) to customers, and either way, the 
> organizations have the ability to backport features and fixes to internal 
> versions. Are there any users out there who simply use Apache Iceberg and 
> depend on the community version?
> 
> There may be features that are broadly useful that do not depend on Spark 
> 3.2. Is it worth supporting them on Spark 3.0/3.1 (and even 2.4)?
> 
> I am not in favor of Option 2. I do not oppose Option 1, but I would consider 
> Option 3 too. Anton, you said 5 modules are required; what are the modules 
> you're thinking of?
> 
> - Wing Yew
> 
> 
> 
> 
> 
> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <[email protected] 
> <mailto:[email protected]>> wrote:
> Option 1 sounds good to me. Here are my reasons:
> 
> 1. Both 2 and 3 will slow down the development. Considering the limited 
> resources in the open source community, the upsides of option 2 and 3 are 
> probably not worthy.
> 2. Both 2 and 3 assume the use cases may not exist. It's hard to predict 
> anything, but even if these use cases are legit, users can still get the new 
> feature by backporting it to an older version in case of upgrading to a newer 
> version isn't an option.
> 
> Best,
> 
> Yufei
> 
> `This is not a contribution`
> 
> 
> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi 
> <[email protected]> wrote:
> To sum up what we have so far:
> 
> 
> Option 1 (support just the most recent minor Spark 3 version)
> 
> The easiest option for us devs, forces the user to upgrade to the most recent 
> minor Spark version to consume any new Iceberg features.
> 
> Option 2 (a separate project under Iceberg)
> 
> Can support as many Spark versions as needed and the codebase is still 
> separate as we can use separate branches.
> Impossible to consume any unreleased changes in core, may slow down the 
> development.
> 
> Option 3 (separate modules for Spark 3.1/3.2)
> 
> Introduce more modules in the same project.
> Can consume unreleased changes but it will required at least 5 modules to 
> support 2.4, 3.1 and 3.2, making the build and testing complicated.
> 
> 
> Are there any users for whom upgrading the minor Spark version (e3.1 to 3.2) 
> to consume new features is a blocker?
> We follow Option 1 internally at the moment but I would like to hear what 
> other people think/need.
> 
> - Anton
> 
> 
>> On 14 Sep 2021, at 09:44, Russell Spitzer <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> I think we should go for option 1. I already am not a big fan of having 
>> runtime errors for unsupported things based on versions and I don't think 
>> minor version upgrades are a large issue for users.  I'm especially not 
>> looking forward to supporting interfaces that only exist in Spark 3.2 in a 
>> multiple Spark version support future.
>> 
>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi 
>>> <[email protected] <mailto:[email protected]>> 
>>> wrote:
>>> 
>>>> First of all, is option 2 a viable option? We discussed separating the 
>>>> python module outside of the project a few weeks ago, and decided to not 
>>>> do that because it's beneficial for code cross reference and more 
>>>> intuitive for new developers to see everything in the same repository. I 
>>>> would expect the same argument to also hold here. 
>>> 
>>> That’s exactly the concern I have about Option 2 at this moment.
>>> 
>>>> Overall I would personally prefer us to not support all the minor 
>>>> versions, but instead support maybe just 2-3 latest versions in a major 
>>>> version. 
>>> 
>>> This is when it gets a bit complicated. If we want to support both Spark 
>>> 3.1 and Spark 3.2 with a single module, it means we have to compile against 
>>> 3.1. The problem is that we rely on DSv2 that is being actively developed. 
>>> 3.2 and 3.1 have substantial differences. On top of that, we have our 
>>> extensions that are extremely low-level and may break not only between 
>>> minor versions but also between patch releases.
>>> 
>>>> f there are some features requiring a newer version, it makes sense to 
>>>> move that newer version in master.
>>> 
>>> Internally, we don’t deliver new features to older Spark versions as it 
>>> requires a lot of effort to port things. Personally, I don’t think it is 
>>> too bad to require users to upgrade if they want new features. At the same 
>>> time, there are valid concerns with this approach too that we mentioned 
>>> during the sync. For example, certain new features would also work fine 
>>> with older Spark versions. I generally agree with that and that not 
>>> supporting recent versions is not ideal. However, I want to find a balance 
>>> between the complexity on our side and ease of use for the users. Ideally, 
>>> supporting a few recent versions would be sufficient but our Spark 
>>> integration is too low-level to do that with a single module.
>>>  
>>> 
>>>> On 13 Sep 2021, at 20:53, Jack Ye <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> First of all, is option 2 a viable option? We discussed separating the 
>>>> python module outside of the project a few weeks ago, and decided to not 
>>>> do that because it's beneficial for code cross reference and more 
>>>> intuitive for new developers to see everything in the same repository. I 
>>>> would expect the same argument to also hold here. 
>>>> 
>>>> Overall I would personally prefer us to not support all the minor 
>>>> versions, but instead support maybe just 2-3 latest versions in a major 
>>>> version. This avoids the problem that some users are unwilling to move to 
>>>> a newer version and keep patching old Spark version branches. If there are 
>>>> some features requiring a newer version, it makes sense to move that newer 
>>>> version in master.
>>>> 
>>>> In addition, because currently Spark is considered the most 
>>>> feature-complete reference implementation compared to all other engines, I 
>>>> think we should not add artificial barriers that would slow down its 
>>>> development speed.
>>>> 
>>>> So my thinking is closer to option 1.
>>>> 
>>>> Best,
>>>> Jack Ye
>>>> 
>>>> 
>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi 
>>>> <[email protected] <mailto:[email protected]>> 
>>>> wrote:
>>>> Hey folks,
>>>> 
>>>> I want to discuss our Spark version support strategy.
>>>> 
>>>> So far, we have tried to support both 3.0 and 3.1. It is great to support 
>>>> older versions but because we compile against 3.0, we cannot use any Spark 
>>>> features that are offered in newer versions.
>>>> Spark 3.2 is just around the corner and it brings a lot of important 
>>>> features such dynamic filtering for v2 tables, required distribution and 
>>>> ordering for writes, etc. These features are too important to ignore them.
>>>> 
>>>> Apart from that, I have an end-to-end prototype for merge-on-read with 
>>>> Spark that actually leverages some of the 3.2 features. I’ll be 
>>>> implementing all new Spark DSv2 APIs for us internally and would love to 
>>>> share that with the rest of the community.
>>>> 
>>>> I see two options to move forward:
>>>> 
>>>> Option 1
>>>> 
>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while by releasing 
>>>> minor versions with bug fixes.
>>>> 
>>>> Pros: almost no changes to the build configuration, no extra work on our 
>>>> side as just a single Spark version is actively maintained.
>>>> Cons: some new features that we will be adding to master could also work 
>>>> with older Spark versions but all 0.12 releases will only contain bug 
>>>> fixes. Therefore, users will be forced to migrate to Spark 3.2 to consume 
>>>> any new Spark or format features.
>>>> 
>>>> Option 2
>>>> 
>>>> Move our Spark integration into a separate project and introduce branches 
>>>> for 3.0, 3.1 and 3.2.
>>>> 
>>>> Pros: decouples the format version from Spark, we can support as many 
>>>> Spark versions as needed.
>>>> Cons: more work initially to set everything up, more work to release, will 
>>>> need a new release of the core format to consume any changes in the Spark 
>>>> integration.
>>>> 
>>>> Overall, I think option 2 seems better for the user but my main worry is 
>>>> that we will have to release the format more frequently (which is a good 
>>>> thing but requires more work and time) and the overall Spark development 
>>>> may be slower.
>>>> 
>>>> I’d love to hear what everybody thinks about this matter.
>>>> 
>>>> Thanks,
>>>> Anton
>>> 
>> 
> 
> 
> 
> -- 
> Ryan Blue
> Tabular

Re: [DISCUSS] Spark version support strategy

Reply via email to