Re: [DISCUSS] Java specific APIs design concern and choice

Wenchen Fan Mon, 27 Apr 2020 07:15:43 -0700

IIUC We are moving away from having 2 classes for Java and Scala, like
JavaRDD and RDD. It's much simpler to maintain and use with a single class.


I don't have a strong preference over option 3 or 4. We may need to collect
more data points from actual users.

On Mon, Apr 27, 2020 at 9:50 PM Hyukjin Kwon <gurwls...@gmail.com> wrote:

> Scala users are arguably more prevailing compared to Java users, yes.
> Using the Java instances in Scala side is legitimate, and they are already
> being used in multiple please. I don't believe Scala
> users find this not Scala friendly as it's legitimate and already being
> used. I personally find it's more trouble some to let Java
> users to search which APIs to call. Yes, I understand the pros and cons -
> we should also find the balance considering the actual usage.
>
> One more argument from me is, though, I think one of the goals in Spark
> APIs is the unified API set up to my knowledge
>  e.g., JavaRDD <> RDD vs DataFrame.
> If either way is not particularly preferred over the other, I would just
> choose the one to have the unified API set.
>
>
>
> 2020년 4월 27일 (월) 오후 10:37, Tom Graves <tgraves...@yahoo.com>님이 작성:
>
>> I agree a general guidance is good so we keep consistent in the apis. I
>> don't necessarily agree that 4 is the best solution though.  I agree its
>> nice to have one api, but it is less friendly for the scala side.
>> Searching for the equivalent Java api shouldn't be hard as it should be
>> very close in the name and if we make it a general rule users should
>> understand it.   I guess one good question is what API do most of our users
>> use between Java and Scala and what is the ratio?  I don't know the answer
>> to that. I've seen more using Scala over Java.  If the majority use Scala
>> then I think the API should be more friendly to that.
>>
>> Tom
>>
>> On Monday, April 27, 2020, 04:04:28 AM CDT, Hyukjin Kwon <
>> gurwls...@gmail.com> wrote:
>>
>>
>> Hi all,
>>
>> I would like to discuss Java specific APIs and which design we will
>> choose.
>> This has been discussed in multiple places so far, for example, at
>> https://github.com/apache/spark/pull/28085#discussion_r407334754
>>
>>
>> *The problem:*
>>
>> In short, I would like us to have clear guidance on how we support Java
>> specific APIs when
>> it requires to return a Java instance. The problem is simple:
>>
>> def requests: Map[String, ExecutorResourceRequest] = ...
>> def requestsJMap: java.util.Map[String, ExecutorResourceRequest] = ...
>>
>> vs
>>
>> def requests: java.util.Map[String, ExecutorResourceRequest] = ...
>>
>>
>> *Current codebase:*
>>
>> My understanding so far was that the latter is preferred and more
>> consistent and prevailing in the
>> existing codebase, for example, see StateOperatorProgress and
>> StreamingQueryProgress in Structured Streaming.
>> However, I realised that we also have other approaches in the current
>> codebase. There look
>> four approaches to deal with Java specifics in general:
>>
>>    1. Java specific classes such as JavaRDD and JavaSparkContext.
>>    2. Java specific methods with the same name that overload its
>>    parameters, see functions.scala.
>>    3. Java specific methods with a different name that needs to return a
>>    different type such as TaskContext.resourcesJMap vs
>>    TaskContext.resources.
>>    4. One method that returns a Java instance for both Scala and Java
>>    sides. see StateOperatorProgress and StreamingQueryProgress.
>>
>>
>> *Analysis on the current codebase:*
>>
>> I agree with 2. approach because the corresponding cases give you a
>> consistent API usage across
>> other language APIs in general. Approach 1. is from the old world when we
>> didn't have unified APIs.
>> This might be the worst approach.
>>
>> 3. and 4. are controversial.
>>
>> For 3., if you have to use Java APIs, then, you should search if there is
>> a variant of that API
>> every time specifically for Java APIs. But yes, it gives you Java/Scala
>> friendly instances.
>>
>> For 4., having one API that returns a Java instance makes you able to use
>> it in both Scala and Java APIs
>> sides although it makes you call asScala in Scala side specifically. But
>> you don’t
>> have to search if there’s a variant of this API and it gives you a
>> consistent API usage across languages.
>>
>> Also, note that calling Java in Scala is legitimate but the opposite case
>> is not, up to my best knowledge.
>> In addition, you should have a method that returns a Java instance for
>> PySpark or SparkR to support.
>>
>>
>> *Proposal:*
>>
>> I would like to have a general guidance on this that the Spark dev agrees
>> upon: Do 4. approach. If not possible, do 3. Avoid 1 almost at all cost.
>>
>> Note that this isn't a hard requirement but *a general guidance*;
>> therefore, the decision might be up to
>> the specific context. For example, when there are some strong arguments
>> to have a separate Java specific API, that’s fine.
>> Of course, we won’t change the existing methods given Micheal’s rubric
>> added before. I am talking about new
>> methods in unreleased branches.
>>
>> Any concern or opinion on this?
>>
>

Re: [DISCUSS] Java specific APIs design concern and choice

Reply via email to