The guidance sounds fine, if the general message is 'keep it simple'.
The right approach might be pretty situational. For example RDD has a
lot of methods that need a Java variant. Putting all the overloads in
one class might be harder to figure out than making a separate return
type with those methods, JavaRDD. (also recall that all of the
overloads show up, for example, in docs and auto-complete, which means
both Java and Scala users have to pick out which one is appropriate).

On Mon, Apr 27, 2020 at 4:04 AM Hyukjin Kwon <gurwls...@gmail.com> wrote:
>
> Hi all,
>
> I would like to discuss Java specific APIs and which design we will choose.
> This has been discussed in multiple places so far, for example, at
> https://github.com/apache/spark/pull/28085#discussion_r407334754
>
>
> The problem:
>
> In short, I would like us to have clear guidance on how we support Java 
> specific APIs when
> it requires to return a Java instance. The problem is simple:
>
> def requests: Map[String, ExecutorResourceRequest] = ...
> def requestsJMap: java.util.Map[String, ExecutorResourceRequest] = ...
>
> vs
>
> def requests: java.util.Map[String, ExecutorResourceRequest] = ...
>
>
> Current codebase:
>
> My understanding so far was that the latter is preferred and more consistent 
> and prevailing in the
> existing codebase, for example, see StateOperatorProgress and 
> StreamingQueryProgress in Structured Streaming.
> However, I realised that we also have other approaches in the current 
> codebase. There look
> four approaches to deal with Java specifics in general:
>
> Java specific classes such as JavaRDD and JavaSparkContext.
> Java specific methods with the same name that overload its parameters, see 
> functions.scala.
> Java specific methods with a different name that needs to return a different 
> type such as TaskContext.resourcesJMap vs  TaskContext.resources.
> One method that returns a Java instance for both Scala and Java sides. see 
> StateOperatorProgress and StreamingQueryProgress.
>
>
> Analysis on the current codebase:
>
> I agree with 2. approach because the corresponding cases give you a 
> consistent API usage across
> other language APIs in general. Approach 1. is from the old world when we 
> didn't have unified APIs.
> This might be the worst approach.
>
> 3. and 4. are controversial.
>
> For 3., if you have to use Java APIs, then, you should search if there is a 
> variant of that API
> every time specifically for Java APIs. But yes, it gives you Java/Scala 
> friendly instances.
>
> For 4., having one API that returns a Java instance makes you able to use it 
> in both Scala and Java APIs
> sides although it makes you call asScala in Scala side specifically. But you 
> don’t
> have to search if there’s a variant of this API and it gives you a consistent 
> API usage across languages.
>
> Also, note that calling Java in Scala is legitimate but the opposite case is 
> not, up to my best knowledge.
> In addition, you should have a method that returns a Java instance for 
> PySpark or SparkR to support.
>
>
> Proposal:
>
> I would like to have a general guidance on this that the Spark dev agrees 
> upon: Do 4. approach. If not possible, do 3. Avoid 1 almost at all cost.
>
> Note that this isn't a hard requirement but a general guidance; therefore, 
> the decision might be up to
> the specific context. For example, when there are some strong arguments to 
> have a separate Java specific API, that’s fine.
> Of course, we won’t change the existing methods given Micheal’s rubric added 
> before. I am talking about new
> methods in unreleased branches.
>
> Any concern or opinion on this?

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to