The guidance sounds fine, if the general message is 'keep it simple'. The right approach might be pretty situational. For example RDD has a lot of methods that need a Java variant. Putting all the overloads in one class might be harder to figure out than making a separate return type with those methods, JavaRDD. (also recall that all of the overloads show up, for example, in docs and auto-complete, which means both Java and Scala users have to pick out which one is appropriate).
On Mon, Apr 27, 2020 at 4:04 AM Hyukjin Kwon <gurwls...@gmail.com> wrote: > > Hi all, > > I would like to discuss Java specific APIs and which design we will choose. > This has been discussed in multiple places so far, for example, at > https://github.com/apache/spark/pull/28085#discussion_r407334754 > > > The problem: > > In short, I would like us to have clear guidance on how we support Java > specific APIs when > it requires to return a Java instance. The problem is simple: > > def requests: Map[String, ExecutorResourceRequest] = ... > def requestsJMap: java.util.Map[String, ExecutorResourceRequest] = ... > > vs > > def requests: java.util.Map[String, ExecutorResourceRequest] = ... > > > Current codebase: > > My understanding so far was that the latter is preferred and more consistent > and prevailing in the > existing codebase, for example, see StateOperatorProgress and > StreamingQueryProgress in Structured Streaming. > However, I realised that we also have other approaches in the current > codebase. There look > four approaches to deal with Java specifics in general: > > Java specific classes such as JavaRDD and JavaSparkContext. > Java specific methods with the same name that overload its parameters, see > functions.scala. > Java specific methods with a different name that needs to return a different > type such as TaskContext.resourcesJMap vs TaskContext.resources. > One method that returns a Java instance for both Scala and Java sides. see > StateOperatorProgress and StreamingQueryProgress. > > > Analysis on the current codebase: > > I agree with 2. approach because the corresponding cases give you a > consistent API usage across > other language APIs in general. Approach 1. is from the old world when we > didn't have unified APIs. > This might be the worst approach. > > 3. and 4. are controversial. > > For 3., if you have to use Java APIs, then, you should search if there is a > variant of that API > every time specifically for Java APIs. But yes, it gives you Java/Scala > friendly instances. > > For 4., having one API that returns a Java instance makes you able to use it > in both Scala and Java APIs > sides although it makes you call asScala in Scala side specifically. But you > don’t > have to search if there’s a variant of this API and it gives you a consistent > API usage across languages. > > Also, note that calling Java in Scala is legitimate but the opposite case is > not, up to my best knowledge. > In addition, you should have a method that returns a Java instance for > PySpark or SparkR to support. > > > Proposal: > > I would like to have a general guidance on this that the Spark dev agrees > upon: Do 4. approach. If not possible, do 3. Avoid 1 almost at all cost. > > Note that this isn't a hard requirement but a general guidance; therefore, > the decision might be up to > the specific context. For example, when there are some strong arguments to > have a separate Java specific API, that’s fine. > Of course, we won’t change the existing methods given Micheal’s rubric added > before. I am talking about new > methods in unreleased branches. > > Any concern or opinion on this? --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org