How would you not make incompatible changes in 3.x? As discussed the InternalRow API is not stable and needs to change.
On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue <rb...@netflix.com> wrote: > > Making downstream to diverge their implementation heavily between minor > versions (say, 2.4 vs 2.5) wouldn't be a good experience > > You're right that the API has been evolving in the 2.x line. But, it is > now reasonably stable with respect to the current feature set and we should > not need to break compatibility in the 3.x line. Because we have reached > our goals for the 3.0 release, we can backport at least those features to > 2.x and confidently have an API that works in both a 2.x release and is > compatible with 3.0, if not 3.1 and later releases as well. > > > I'd rather say preparation of Spark 2.5 should be started after Spark > 3.0 is officially released > > The reason I'm suggesting this is that I'm already going to do the work to > backport the 3.0 release features to 2.4. I've been asked by several people > when DSv2 will be released, so I know there is a lot of interest in making > this available sooner than 3.0. If I'm already doing the work, then I'd be > happy to share that with the community. > > I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5 > while preparing the 3.0 preview and fixing bugs. For DSv2, the work is > about complete so we can easily release the same set of features and API in > 2.5 and 3.0. > > If we decide for some reason to wait until after 3.0 is released, I don't > know that there is much value in a 2.5. The purpose is to be a step toward > 3.0, and releasing that step after 3.0 doesn't seem helpful to me. It also > wouldn't get these features out any sooner than 3.0, as a 2.5 release > probably would, given the work needed to validate the incompatible changes > in 3.0. > > > DSv2 change would be the major backward incompatibility which Spark 2.x > users may hesitate to upgrade > > As I pointed out, DSv2 has been changing in the 2.x line, so this is > expected. I don't think it will need incompatible changes in the 3.x line. > > On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim <kabh...@gmail.com> wrote: > >> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to >> deal with this as the change made confusion on my PRs...), but my bet is >> that DSv2 would be already changed in incompatible way, at least who works >> for custom DataSource. Making downstream to diverge their implementation >> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good >> experience - especially we are not completely closed the chance to further >> modify DSv2, and the change could be backward incompatible. >> >> If we really want to bring the DSv2 change to 2.x version line to let end >> users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say >> preparation of Spark 2.5 should be started after Spark 3.0 is officially >> released, honestly even later than that, say, getting some reports from >> Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make Spark >> 2.5 be a kind of "tech-preview" which Spark 2.4 users may be frustrated to >> upgrade to next minor version. >> >> Btw, do we have any specific target users for this? Personally DSv2 >> change would be the major backward incompatibility which Spark 2.x users >> may hesitate to upgrade, so they might be already prepared to migrate to >> Spark 3.0 if they are prepared to migrate to new DSv2. >> >> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <dongjoon.h...@gmail.com> >> wrote: >> >>> Do you mean you want to have a breaking API change between 3.0 and 3.1? >>> I believe we follow Semantic Versioning ( >>> https://spark.apache.org/versioning-policy.html ). >>> >>> > We just won’t add any breaking changes before 3.1. >>> >>> Bests, >>> Dongjoon. >>> >>> >>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <rb...@netflix.com.invalid> >>> wrote: >>> >>>> I don’t think we need to gate a 3.0 release on making a more stable >>>> version of InternalRow >>>> >>>> Sounds like we agree, then. We will use it for 3.0, but there are known >>>> problems with it. >>>> >>>> Thinking we’d have dsv2 working in both 3.x (which will change and >>>> progress towards more stable, but will have to break certain APIs) and 2.x >>>> seems like a false premise. >>>> >>>> Why do you think we will need to break certain APIs before 3.0? >>>> >>>> I’m only suggesting that we release the same support in a 2.5 release >>>> that we do in 3.0. Since we are nearly finished with the 3.0 goals, it >>>> seems like we can certainly do that. We just won’t add any breaking changes >>>> before 3.1. >>>> >>>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <r...@databricks.com> >>>> wrote: >>>> >>>>> I don't think we need to gate a 3.0 release on making a more stable >>>>> version of InternalRow, but thinking we'd have dsv2 working in both 3.x >>>>> (which will change and progress towards more stable, but will have to >>>>> break >>>>> certain APIs) and 2.x seems like a false premise. >>>>> >>>>> To point out some problems with InternalRow that you think are already >>>>> pragmatic and stable: >>>>> >>>>> The class is in catalyst, which states: >>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala >>>>> >>>>> /** >>>>> * Catalyst is a library for manipulating relational query plans. All >>>>> classes in catalyst are >>>>> * considered an internal API to Spark SQL and are subject to change >>>>> between minor releases. >>>>> */ >>>>> >>>>> There is no even any annotation on the interface. >>>>> >>>>> The entire dependency chain were created to be private, and tightly >>>>> coupled with internal implementations. For example, >>>>> >>>>> >>>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java >>>>> >>>>> /** >>>>> * A UTF-8 String for internal Spark use. >>>>> * <p> >>>>> * A String encoded in UTF-8 as an Array[Byte], which can be used for >>>>> comparison, >>>>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details. >>>>> * <p> >>>>> * Note: This is not designed for general use cases, should not be used >>>>> outside SQL. >>>>> */ >>>>> >>>>> >>>>> >>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala >>>>> >>>>> (which again is in catalyst package) >>>>> >>>>> >>>>> If you want to argue this way, you might as well argue we should make >>>>> the entire catalyst package public to be pragmatic and not allow any >>>>> changes. >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <rb...@netflix.com> wrote: >>>>> >>>>>> When you created the PR to make InternalRow public >>>>>> >>>>>> This isn’t quite accurate. The change I made was to use InternalRow >>>>>> instead of UnsafeRow, which is a specific implementation of >>>>>> InternalRow. Exposing this API has always been a part of DSv2 and >>>>>> while both you and I did some work to avoid this, we are still in the >>>>>> phase >>>>>> of starting with that API. >>>>>> >>>>>> Note that any change to InternalRow would be very costly to >>>>>> implement because this interface is widely used. That is why I think we >>>>>> can >>>>>> certainly consider it stable enough to use here, and that’s probably why >>>>>> UnsafeRow was part of the original proposal. >>>>>> >>>>>> In any case, the goal for 3.0 was not to replace the use of >>>>>> InternalRow, it was to get the majority of SQL working on top of the >>>>>> interface added after 2.4. That’s done and stable, so I think a 2.5 >>>>>> release >>>>>> with it is also reasonable. >>>>>> >>>>>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <r...@databricks.com> >>>>>> wrote: >>>>>> >>>>>>> To push back, while I agree we should not drastically change >>>>>>> "InternalRow", there are a lot of changes that need to happen to make it >>>>>>> stable. For example, none of the publicly exposed interfaces should be >>>>>>> in >>>>>>> the Catalyst package or the unsafe package. External implementations >>>>>>> should >>>>>>> be decoupled from the internal implementations, with cheap ways to >>>>>>> convert >>>>>>> back and forth. >>>>>>> >>>>>>> When you created the PR to make InternalRow public, the >>>>>>> understanding was to work towards making it stable in the future, >>>>>>> assuming >>>>>>> we will start with an unstable API temporarily. You can't just make a >>>>>>> bunch >>>>>>> internal APIs tightly coupled with other internal pieces public and >>>>>>> stable >>>>>>> and call it a day, just because it happen to satisfy some use cases >>>>>>> temporarily assuming the rest of Spark doesn't change. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <rb...@netflix.com> >>>>>>> wrote: >>>>>>> >>>>>>>> > DSv2 is far from stable right? >>>>>>>> >>>>>>>> No, I think it is reasonably stable and very close to being ready >>>>>>>> for a release. >>>>>>>> >>>>>>>> > All the actual data types are unstable and you guys have >>>>>>>> completely ignored that. >>>>>>>> >>>>>>>> I think what you're referring to is the use of `InternalRow`. >>>>>>>> That's a stable API and there has been no work to avoid using it. In >>>>>>>> any >>>>>>>> case, I don't think that anyone is suggesting that we delay 3.0 until a >>>>>>>> replacement for `InternalRow` is added, right? >>>>>>>> >>>>>>>> While I understand the motivation for a better solution here, I >>>>>>>> think the pragmatic solution is to continue using `InternalRow`. >>>>>>>> >>>>>>>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems >>>>>>>> too invasive of a change to backport once you consider the parts >>>>>>>> needed to >>>>>>>> make dsv2 stable. >>>>>>>> >>>>>>>> I believe that those of us working on DSv2 are confident about the >>>>>>>> current stability. We set goals for what to get into the 3.0 release >>>>>>>> months >>>>>>>> ago and have very nearly reached the point where we are ready for that >>>>>>>> release. >>>>>>>> >>>>>>>> I don't think instability would be a problem in maintaining >>>>>>>> compatibility between the 2.5 version and the 3.0 version. If we find >>>>>>>> that >>>>>>>> we need to make API changes (other than additions) then we can make >>>>>>>> those >>>>>>>> in the 3.1 release. Because the goals we set for the 3.0 release have >>>>>>>> been >>>>>>>> reached with the current API and if we are ready to release 3.0, we can >>>>>>>> release a 2.5 with the same API. >>>>>>>> >>>>>>>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <r...@databricks.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> DSv2 is far from stable right? All the actual data types are >>>>>>>>> unstable and you guys have completely ignored that. We'd need to work >>>>>>>>> on >>>>>>>>> that and that will be a breaking change. If the goal is to make DSv2 >>>>>>>>> work >>>>>>>>> across 3.x and 2.x, that seems too invasive of a change to backport >>>>>>>>> once >>>>>>>>> you consider the parts needed to make dsv2 stable. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue < >>>>>>>>> rb...@netflix.com.invalid> wrote: >>>>>>>>> >>>>>>>>>> Hi everyone, >>>>>>>>>> >>>>>>>>>> In the DSv2 sync this week, we talked about a possible Spark 2.5 >>>>>>>>>> release based on the latest Spark 2.4, but with DSv2 and Java 11 >>>>>>>>>> support >>>>>>>>>> added. >>>>>>>>>> >>>>>>>>>> A Spark 2.5 release with these two additions will help people >>>>>>>>>> migrate to Spark 3.0 when it is released because they will be able >>>>>>>>>> to use a >>>>>>>>>> single implementation for DSv2 sources that works in both 2.5 and >>>>>>>>>> 3.0. >>>>>>>>>> Similarly, upgrading to 3.0 won't also require also updating to Java >>>>>>>>>> 11 >>>>>>>>>> because users could update to Java 11 with the 2.5 release and have >>>>>>>>>> fewer >>>>>>>>>> major changes. >>>>>>>>>> >>>>>>>>>> Another reason to consider a 2.5 release is that many people are >>>>>>>>>> interested in a release with the latest DSv2 API and support for >>>>>>>>>> DSv2 SQL. >>>>>>>>>> I'm already going to be backporting DSv2 support to the Spark 2.4 >>>>>>>>>> line, so >>>>>>>>>> it makes sense to share this work with the community. >>>>>>>>>> >>>>>>>>>> This release line would just consist of backports like DSv2 and >>>>>>>>>> Java 11 that assist compatibility, to keep the scope of the release >>>>>>>>>> small. >>>>>>>>>> The purpose is to assist people moving to 3.0 and not distract from >>>>>>>>>> the 3.0 >>>>>>>>>> release. >>>>>>>>>> >>>>>>>>>> Would a Spark 2.5 release help anyone else? Are there any >>>>>>>>>> concerns about this plan? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> rb >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Ryan Blue >>>>>>>>>> Software Engineer >>>>>>>>>> Netflix >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Ryan Blue >>>>>>>> Software Engineer >>>>>>>> Netflix >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Software Engineer >>>>>> Netflix >>>>>> >>>>> >>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>> >> >> -- >> Name : Jungtaek Lim >> Blog : http://medium.com/@heartsavior >> Twitter : http://twitter.com/heartsavior >> LinkedIn : http://www.linkedin.com/in/heartsavior >> > > > -- > Ryan Blue > Software Engineer > Netflix >