I agree with Matei too. Thanks, Marco
Il giorno dom 22 set 2019 alle ore 03:44 Dongjoon Hyun < dongjoon.h...@gmail.com> ha scritto: > +1 for Matei's suggestion! > > Bests, > Dongjoon. > > On Sat, Sep 21, 2019 at 5:44 PM Matei Zaharia <matei.zaha...@gmail.com> > wrote: > >> If the goal is to get people to try the DSv2 API and build DSv2 data >> sources, can we recommend the 3.0-preview release for this? That would get >> people shifting to 3.0 faster, which is probably better overall compared to >> maintaining two major versions. There’s not that much else changing in 3.0 >> if you already want to update your Java version. >> >> On Sep 21, 2019, at 2:45 PM, Ryan Blue <rb...@netflix.com.INVALID> wrote: >> >> > If you insist we shouldn't change the unstable temporary API in 3.x . . >> . >> >> Not what I'm saying at all. I said we should carefully consider whether a >> breaking change is the right decision in the 3.x line. >> >> All I'm suggesting is that we can make a 2.5 release with the feature and >> an API that is the same as the one in 3.0. >> >> > I also don't get this backporting a giant feature to 2.x line >> >> I am planning to do this so we can use DSv2 before 3.0 is released. Then >> we can have a source implementation that works in both 2.x and 3.0 to make >> the transition easier. Since I'm already doing the work, I'm offering to >> share it with the community. >> >> >> On Sat, Sep 21, 2019 at 2:36 PM Reynold Xin <r...@databricks.com> wrote: >> >>> Because for example we'd need to move the location of InternalRow, >>> breaking the package name. If you insist we shouldn't change the unstable >>> temporary API in 3.x to maintain compatibility with 3.0, which is totally >>> different from my understanding of the situation when you exposed it, then >>> I'd say we should gate 3.0 on having a stable row interface. >>> >>> I also don't get this backporting a giant feature to 2.x line ... as >>> suggested by others in the thread, DSv2 would be one of the main reasons >>> people upgrade to 3.0. What's so special about DSv2 that we are doing this? >>> Why not abandoning 3.0 entirely and backport all the features to 2.x? >>> >>> >>> >>> On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue <rb...@netflix.com> wrote: >>> >>>> Why would that require an incompatible change? >>>> >>>> We *could* make an incompatible change and remove support for >>>> InternalRow, but I think we would want to carefully consider whether that >>>> is the right decision. And in any case, we would be able to keep 2.5 and >>>> 3.0 compatible, which is the main goal. >>>> >>>> On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin <r...@databricks.com> >>>> wrote: >>>> >>>> How would you not make incompatible changes in 3.x? As discussed the >>>> InternalRow API is not stable and needs to change. >>>> >>>> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue <rb...@netflix.com> wrote: >>>> >>>> > Making downstream to diverge their implementation heavily between >>>> minor versions (say, 2.4 vs 2.5) wouldn't be a good experience >>>> >>>> You're right that the API has been evolving in the 2.x line. But, it is >>>> now reasonably stable with respect to the current feature set and we should >>>> not need to break compatibility in the 3.x line. Because we have reached >>>> our goals for the 3.0 release, we can backport at least those features to >>>> 2.x and confidently have an API that works in both a 2.x release and is >>>> compatible with 3.0, if not 3.1 and later releases as well. >>>> >>>> > I'd rather say preparation of Spark 2.5 should be started after Spark >>>> 3.0 is officially released >>>> >>>> The reason I'm suggesting this is that I'm already going to do the work >>>> to backport the 3.0 release features to 2.4. I've been asked by several >>>> people when DSv2 will be released, so I know there is a lot of interest in >>>> making this available sooner than 3.0. If I'm already doing the work, then >>>> I'd be happy to share that with the community. >>>> >>>> I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5 >>>> while preparing the 3.0 preview and fixing bugs. For DSv2, the work is >>>> about complete so we can easily release the same set of features and API in >>>> 2.5 and 3.0. >>>> >>>> If we decide for some reason to wait until after 3.0 is released, I >>>> don't know that there is much value in a 2.5. The purpose is to be a step >>>> toward 3.0, and releasing that step after 3.0 doesn't seem helpful to me. >>>> It also wouldn't get these features out any sooner than 3.0, as a 2.5 >>>> release probably would, given the work needed to validate the incompatible >>>> changes in 3.0. >>>> >>>> > DSv2 change would be the major backward incompatibility which Spark >>>> 2.x users may hesitate to upgrade >>>> >>>> As I pointed out, DSv2 has been changing in the 2.x line, so this is >>>> expected. I don't think it will need incompatible changes in the 3.x line. >>>> >>>> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim <kabh...@gmail.com> wrote: >>>> >>>> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to >>>> deal with this as the change made confusion on my PRs...), but my bet is >>>> that DSv2 would be already changed in incompatible way, at least who works >>>> for custom DataSource. Making downstream to diverge their implementation >>>> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good >>>> experience - especially we are not completely closed the chance to further >>>> modify DSv2, and the change could be backward incompatible. >>>> >>>> If we really want to bring the DSv2 change to 2.x version line to let >>>> end users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather >>>> say preparation of Spark 2.5 should be started after Spark 3.0 is >>>> officially released, honestly even later than that, say, getting some >>>> reports from Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we >>>> don't make Spark 2.5 be a kind of "tech-preview" which Spark 2.4 users may >>>> be frustrated to upgrade to next minor version. >>>> >>>> Btw, do we have any specific target users for this? Personally DSv2 >>>> change would be the major backward incompatibility which Spark 2.x users >>>> may hesitate to upgrade, so they might be already prepared to migrate to >>>> Spark 3.0 if they are prepared to migrate to new DSv2. >>>> >>>> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <dongjoon.h...@gmail.com> >>>> wrote: >>>> >>>> Do you mean you want to have a breaking API change between 3.0 and 3.1? >>>> I believe we follow Semantic Versioning ( >>>> https://spark.apache.org/versioning-policy.html ). >>>> >>>> > We just won’t add any breaking changes before 3.1. >>>> >>>> Bests, >>>> Dongjoon. >>>> >>>> >>>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <rb...@netflix.com.invalid> >>>> wrote: >>>> >>>> I don’t think we need to gate a 3.0 release on making a more stable >>>> version of InternalRow >>>> >>>> Sounds like we agree, then. We will use it for 3.0, but there are known >>>> problems with it. >>>> >>>> Thinking we’d have dsv2 working in both 3.x (which will change and >>>> progress towards more stable, but will have to break certain APIs) and 2.x >>>> seems like a false premise. >>>> >>>> Why do you think we will need to break certain APIs before 3.0? >>>> >>>> I’m only suggesting that we release the same support in a 2.5 release >>>> that we do in 3.0. Since we are nearly finished with the 3.0 goals, it >>>> seems like we can certainly do that. We just won’t add any breaking changes >>>> before 3.1. >>>> >>>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <r...@databricks.com> >>>> wrote: >>>> >>>> I don't think we need to gate a 3.0 release on making a more stable >>>> version of InternalRow, but thinking we'd have dsv2 working in both 3.x >>>> (which will change and progress towards more stable, but will have to break >>>> certain APIs) and 2.x seems like a false premise. >>>> >>>> To point out some problems with InternalRow that you think are already >>>> pragmatic and stable: >>>> >>>> The class is in catalyst, which states: >>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala >>>> >>>> /** >>>> * Catalyst is a library for manipulating relational query plans. All >>>> classes in catalyst are >>>> * considered an internal API to Spark SQL and are subject to change >>>> between minor releases. >>>> */ >>>> >>>> There is no even any annotation on the interface. >>>> >>>> The entire dependency chain were created to be private, and tightly >>>> coupled with internal implementations. For example, >>>> >>>> >>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java >>>> >>>> /** >>>> * A UTF-8 String for internal Spark use. >>>> * <p> >>>> * A String encoded in UTF-8 as an Array[Byte], which can be used for >>>> comparison, >>>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details. >>>> * <p> >>>> * Note: This is not designed for general use cases, should not be used >>>> outside SQL. >>>> */ >>>> >>>> >>>> >>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala >>>> >>>> (which again is in catalyst package) >>>> >>>> >>>> If you want to argue this way, you might as well argue we should make >>>> the entire catalyst package public to be pragmatic and not allow any >>>> changes. >>>> >>>> >>>> >>>> >>>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <rb...@netflix.com> wrote: >>>> >>>> When you created the PR to make InternalRow public >>>> >>>> This isn’t quite accurate. The change I made was to use InternalRow >>>> instead of UnsafeRow, which is a specific implementation of InternalRow. >>>> Exposing this API has always been a part of DSv2 and while both you and I >>>> did some work to avoid this, we are still in the phase of starting with >>>> that API. >>>> >>>> Note that any change to InternalRow would be very costly to implement >>>> because this interface is widely used. That is why I think we can certainly >>>> consider it stable enough to use here, and that’s probably why >>>> UnsafeRow was part of the original proposal. >>>> >>>> In any case, the goal for 3.0 was not to replace the use of InternalRow, >>>> it was to get the majority of SQL working on top of the interface added >>>> after 2.4. That’s done and stable, so I think a 2.5 release with it is also >>>> reasonable. >>>> >>>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <r...@databricks.com> >>>> wrote: >>>> >>>> To push back, while I agree we should not drastically change >>>> "InternalRow", there are a lot of changes that need to happen to make it >>>> stable. For example, none of the publicly exposed interfaces should be in >>>> the Catalyst package or the unsafe package. External implementations should >>>> be decoupled from the internal implementations, with cheap ways to convert >>>> back and forth. >>>> >>>> When you created the PR to make InternalRow public, the understanding >>>> was to work towards making it stable in the future, assuming we will start >>>> with an unstable API temporarily. You can't just make a bunch internal APIs >>>> tightly coupled with other internal pieces public and stable and call it a >>>> day, just because it happen to satisfy some use cases temporarily assuming >>>> the rest of Spark doesn't change. >>>> >>>> >>>> >>>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <rb...@netflix.com> wrote: >>>> >>>> > DSv2 is far from stable right? >>>> >>>> No, I think it is reasonably stable and very close to being ready for a >>>> release. >>>> >>>> > All the actual data types are unstable and you guys have completely >>>> ignored that. >>>> >>>> I think what you're referring to is the use of `InternalRow`. That's a >>>> stable API and there has been no work to avoid using it. In any case, I >>>> don't think that anyone is suggesting that we delay 3.0 until a replacement >>>> for `InternalRow` is added, right? >>>> >>>> While I understand the motivation for a better solution here, I think >>>> the pragmatic solution is to continue using `InternalRow`. >>>> >>>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too >>>> invasive of a change to backport once you consider the parts needed to make >>>> dsv2 stable. >>>> >>>> I believe that those of us working on DSv2 are confident about the >>>> current stability. We set goals for what to get into the 3.0 release months >>>> ago and have very nearly reached the point where we are ready for that >>>> release. >>>> >>>> I don't think instability would be a problem in maintaining >>>> compatibility between the 2.5 version and the 3.0 version. If we find that >>>> we need to make API changes (other than additions) then we can make those >>>> in the 3.1 release. Because the goals we set for the 3.0 release have been >>>> reached with the current API and if we are ready to release 3.0, we can >>>> release a 2.5 with the same API. >>>> >>>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <r...@databricks.com> >>>> wrote: >>>> >>>> DSv2 is far from stable right? All the actual data types are unstable >>>> and you guys have completely ignored that. We'd need to work on that and >>>> that will be a breaking change. If the goal is to make DSv2 work across 3.x >>>> and 2.x, that seems too invasive of a change to backport once you consider >>>> the parts needed to make dsv2 stable. >>>> >>>> >>>> >>>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <rb...@netflix.com.invalid> >>>> wrote: >>>> >>>> Hi everyone, >>>> >>>> In the DSv2 sync this week, we talked about a possible Spark 2.5 >>>> release based on the latest Spark 2.4, but with DSv2 and Java 11 support >>>> added. >>>> >>>> A Spark 2.5 release with these two additions will help people migrate >>>> to Spark 3.0 when it is released because they will be able to use a single >>>> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly, >>>> upgrading to 3.0 won't also require also updating to Java 11 because users >>>> could update to Java 11 with the 2.5 release and have fewer major changes. >>>> >>>> Another reason to consider a 2.5 release is that many people are >>>> interested in a release with the latest DSv2 API and support for DSv2 SQL. >>>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so >>>> it makes sense to share this work with the community. >>>> >>>> This release line would just consist of backports like DSv2 and Java 11 >>>> that assist compatibility, to keep the scope of the release small. The >>>> purpose is to assist people moving to 3.0 and not distract from the 3.0 >>>> release. >>>> >>>> Would a Spark 2.5 release help anyone else? Are there any concerns >>>> about this plan? >>>> >>>> >>>> rb >>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>>> >>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>>> >>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>>> >>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>>> >>>> >>>> -- >>>> Name : Jungtaek Lim >>>> Blog : http://medium.com/@heartsavior >>>> Twitter : http://twitter.com/heartsavior >>>> LinkedIn : http://www.linkedin.com/in/heartsavior >>>> >>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>> >>> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> >> >>