The problem is that calling Scala instances in Java side is discouraged in general up to my best knowledge. A Java user won't likely know asJava in Scala but a Scala user will likely know both asScala and asJava.
2020년 4월 28일 (화) 오전 11:35, ZHANG Wei <wezh...@outlook.com>님이 작성: > How about making a small change on option 4: > Keep Scala API returning Scala type instance with providing a > `asJava` method to return a Java type instance. > > Scala 2.13 has provided CollectionConverter [1][2][3], in the following > Spark dependences upgrade, which can be supported by nature. For > current Scala 2.12 version, we can wrap `ImplicitConversionsToJava`[4] > as what Scala 2.13 does and add implicit conversions. > > Just my 2 cents. > > -- > Cheers, > -z > > [1] > https://docs.scala-lang.org/overviews/collections-2.13/conversions-between-java-and-scala-collections.html > [2] > https://www.scala-lang.org/api/2.13.0/scala/jdk/javaapi/CollectionConverters$.html > [3] > https://www.scala-lang.org/api/2.13.0/scala/jdk/CollectionConverters$.html > [4] > https://www.scala-lang.org/api/2.12.11/scala/collection/convert/ImplicitConversionsToJava$.html > > > On Tue, 28 Apr 2020 08:52:57 +0900 > Hyukjin Kwon <gurwls...@gmail.com> wrote: > > > I would like to make sure I am open for other options that can be > > considered situationally and based on the context. > > It's okay, and I don't target to restrict this here. For example, DSv2, I > > understand it's written in Java because Java > > interfaces arguably brings better performance. That's why vectorized > > readers are written in Java too. > > > > Maybe the "general" wasn't explicit in my previous email. Adding APIs to > > return a Java instance is still > > rather rare in general given my few years monitoring. > > The problem I would more like to deal with is more about when we need to > > add one or a couple of user-facing > > Java-specific APIs to return Java instances, which is relatively more > > frequent compared to when we need a bunch > > of Java specific APIs. > > > > In this case, I think it should be guided to use 4. approach. There are > > pros and cons between 3. and 4., of course. > > But it looks to me 4. approach is closer to what Spark has targeted so > far. > > > > > > > > 2020년 4월 28일 (화) 오전 8:34, Hyukjin Kwon <gurwls...@gmail.com>님이 작성: > > > > > > One thing we could do here is use Java collections internally and > make > > > the Scala API a thin wrapper around Java -- like how Python works. > > > > Then adding a method to the Scala API would require adding it to the > > > Java API and we would keep the two more in sync. > > > > > > I think it can be an appropriate idea for when we have to deal with > this > > > case a lot but I don't think there are so many > > > user-facing APIs to return a Java collections, it's rather rare. Also, > the > > > Java users are relatively less than Scala users. > > > This case is slightly different from Python in a way that there are so > > > many differences to deal with in PySpark case. > > > > > > Also, in case of `Seq`, actually we can just use `Array` instead for > both > > > Scala and Java side simply. I don't find such cases notably awkward. > > > This problematic cases might be specific to few Java collections or > > > instances, and I would like to avoid an overkill here. > > > > > > Of course, if there is a place to consider other options, let's do. I > > > don't like to say this is the only required option. > > > > > > > > > > > > > > > > > > 2020년 4월 28일 (화) 오전 1:18, Ryan Blue <rb...@netflix.com.invalid>님이 작성: > > > > > >> I think the right choice here depends on how the object is used. For > > >> developer and internal APIs, I think standardizing on Java collections > > >> makes the most sense. > > >> > > >> For user-facing APIs, it is awkward to return Java collections to > Scala > > >> code -- I think that's the motivation for Tom's comment. For user > APIs, I > > >> think most methods should return Scala collections, and I don't have a > > >> strong opinion about whether the conversion (or lack thereof) is done > in a > > >> separate object (#1) or in parallel methods (#3). > > >> > > >> Both #1 and #3 seem like about the same amount of work and have the > same > > >> likelihood that a developer will leave out a Java method version. One > thing > > >> we could do here is use Java collections internally and make the > Scala API > > >> a thin wrapper around Java -- like how Python works. Then adding a > method > > >> to the Scala API would require adding it to the Java API and we would > keep > > >> the two more in sync. It would also help avoid Scala collections > leaking > > >> into internals. > > >> > > >> On Mon, Apr 27, 2020 at 8:49 AM Hyukjin Kwon <gurwls...@gmail.com> > wrote: > > >> > > >>> Let's stick to the less maintenance efforts then rather than we > leave it > > >>> undecided and delay with leaving this inconsistency. > > >>> > > >>> I dont think we can have some very meaningful data about this soon > given > > >>> that we don't hear much complaints about this in general so far. > > >>> > > >>> The point of this thread is to make a call rather then defer to the > > >>> future. > > >>> > > >>> On Mon, 27 Apr 2020, 23:15 Wenchen Fan, <cloud0...@gmail.com> wrote: > > >>> > > >>>> IIUC We are moving away from having 2 classes for Java and Scala, > like > > >>>> JavaRDD and RDD. It's much simpler to maintain and use with a > single class. > > >>>> > > >>>> I don't have a strong preference over option 3 or 4. We may need to > > >>>> collect more data points from actual users. > > >>>> > > >>>> On Mon, Apr 27, 2020 at 9:50 PM Hyukjin Kwon <gurwls...@gmail.com> > > >>>> wrote: > > >>>> > > >>>>> Scala users are arguably more prevailing compared to Java users, > yes. > > >>>>> Using the Java instances in Scala side is legitimate, and they are > > >>>>> already being used in multiple please. I don't believe Scala > > >>>>> users find this not Scala friendly as it's legitimate and already > > >>>>> being used. I personally find it's more trouble some to let Java > > >>>>> users to search which APIs to call. Yes, I understand the pros and > > >>>>> cons - we should also find the balance considering the actual > usage. > > >>>>> > > >>>>> One more argument from me is, though, I think one of the goals in > > >>>>> Spark APIs is the unified API set up to my knowledge > > >>>>> e.g., JavaRDD <> RDD vs DataFrame. > > >>>>> If either way is not particularly preferred over the other, I would > > >>>>> just choose the one to have the unified API set. > > >>>>> > > >>>>> > > >>>>> > > >>>>> 2020년 4월 27일 (월) 오후 10:37, Tom Graves <tgraves...@yahoo.com>님이 작성: > > >>>>> > > >>>>>> I agree a general guidance is good so we keep consistent in the > apis. > > >>>>>> I don't necessarily agree that 4 is the best solution though. I > agree its > > >>>>>> nice to have one api, but it is less friendly for the scala side. > > >>>>>> Searching for the equivalent Java api shouldn't be hard as it > should be > > >>>>>> very close in the name and if we make it a general rule users > should > > >>>>>> understand it. I guess one good question is what API do most of > our users > > >>>>>> use between Java and Scala and what is the ratio? I don't know > the answer > > >>>>>> to that. I've seen more using Scala over Java. If the majority > use Scala > > >>>>>> then I think the API should be more friendly to that. > > >>>>>> > > >>>>>> Tom > > >>>>>> > > >>>>>> On Monday, April 27, 2020, 04:04:28 AM CDT, Hyukjin Kwon < > > >>>>>> gurwls...@gmail.com> wrote: > > >>>>>> > > >>>>>> > > >>>>>> Hi all, > > >>>>>> > > >>>>>> I would like to discuss Java specific APIs and which design we > will > > >>>>>> choose. > > >>>>>> This has been discussed in multiple places so far, for example, at > > >>>>>> https://github.com/apache/spark/pull/28085#discussion_r407334754 > > >>>>>> > > >>>>>> > > >>>>>> *The problem:* > > >>>>>> > > >>>>>> In short, I would like us to have clear guidance on how we support > > >>>>>> Java specific APIs when > > >>>>>> it requires to return a Java instance. The problem is simple: > > >>>>>> > > >>>>>> def requests: Map[String, ExecutorResourceRequest] = ... > > >>>>>> def requestsJMap: java.util.Map[String, ExecutorResourceRequest] > = ... > > >>>>>> > > >>>>>> vs > > >>>>>> > > >>>>>> def requests: java.util.Map[String, ExecutorResourceRequest] = ... > > >>>>>> > > >>>>>> > > >>>>>> *Current codebase:* > > >>>>>> > > >>>>>> My understanding so far was that the latter is preferred and more > > >>>>>> consistent and prevailing in the > > >>>>>> existing codebase, for example, see StateOperatorProgress and > > >>>>>> StreamingQueryProgress in Structured Streaming. > > >>>>>> However, I realised that we also have other approaches in the > current > > >>>>>> codebase. There look > > >>>>>> four approaches to deal with Java specifics in general: > > >>>>>> > > >>>>>> 1. Java specific classes such as JavaRDD and JavaSparkContext. > > >>>>>> 2. Java specific methods with the same name that overload its > > >>>>>> parameters, see functions.scala. > > >>>>>> 3. Java specific methods with a different name that needs to > > >>>>>> return a different type such as TaskContext.resourcesJMap vs > > >>>>>> TaskContext.resources. > > >>>>>> 4. One method that returns a Java instance for both Scala and > > >>>>>> Java sides. see StateOperatorProgress and > StreamingQueryProgress. > > >>>>>> > > >>>>>> > > >>>>>> *Analysis on the current codebase:* > > >>>>>> > > >>>>>> I agree with 2. approach because the corresponding cases give you > a > > >>>>>> consistent API usage across > > >>>>>> other language APIs in general. Approach 1. is from the old world > > >>>>>> when we didn't have unified APIs. > > >>>>>> This might be the worst approach. > > >>>>>> > > >>>>>> 3. and 4. are controversial. > > >>>>>> > > >>>>>> For 3., if you have to use Java APIs, then, you should search if > > >>>>>> there is a variant of that API > > >>>>>> every time specifically for Java APIs. But yes, it gives you > > >>>>>> Java/Scala friendly instances. > > >>>>>> > > >>>>>> For 4., having one API that returns a Java instance makes you > able to > > >>>>>> use it in both Scala and Java APIs > > >>>>>> sides although it makes you call asScala in Scala side > specifically. > > >>>>>> But you don’t > > >>>>>> have to search if there’s a variant of this API and it gives you a > > >>>>>> consistent API usage across languages. > > >>>>>> > > >>>>>> Also, note that calling Java in Scala is legitimate but the > opposite > > >>>>>> case is not, up to my best knowledge. > > >>>>>> In addition, you should have a method that returns a Java instance > > >>>>>> for PySpark or SparkR to support. > > >>>>>> > > >>>>>> > > >>>>>> *Proposal:* > > >>>>>> > > >>>>>> I would like to have a general guidance on this that the Spark dev > > >>>>>> agrees upon: Do 4. approach. If not possible, do 3. Avoid 1 > almost at all > > >>>>>> cost. > > >>>>>> > > >>>>>> Note that this isn't a hard requirement but *a general guidance*; > > >>>>>> therefore, the decision might be up to > > >>>>>> the specific context. For example, when there are some strong > > >>>>>> arguments to have a separate Java specific API, that’s fine. > > >>>>>> Of course, we won’t change the existing methods given Micheal’s > > >>>>>> rubric added before. I am talking about new > > >>>>>> methods in unreleased branches. > > >>>>>> > > >>>>>> Any concern or opinion on this? > > >>>>>> > > >>>>> > > >> > > >> -- > > >> Ryan Blue > > >> Software Engineer > > >> Netflix > > >> > > > >