Re: [DISCUSS] Java specific APIs design concern and choice

2020-05-11 Thread Hyukjin Kwon
Had a short sync with Tom. I am going to postpone this for now since this
case is very unlikely - I have seen this twice for the last 5 years.
We'll go for a vote when we happen to see this more, and make a decision
based on the feedback in the vote thread.


2020년 5월 11일 (월) 오후 11:08, Hyukjin Kwon 님이 작성:

> The guide is our official guide, see "Code Style Guide" in
> http://spark.apache.org/contributing.html.
> As I said this is a general guidance, instead of a hard strict policy. I
> don't target to change existing APIs either.
> I would like to not start the vote when I see the clear objection to
> address, Tom. I would like to address it.
>
> > So as I've already stated and it looks like 2 others have issues with
> number 4 as written as well, I'm against you posting this as is.  I do not
> think we should recommend 4 for public user facing Scala API
>
> The main argument from you looks Scala/Java friendly (and Java users are
> smaller than Scala).
> The first argument is not quite correct because using Java is in the
> official Scala guide. As I mentioned, it is not awkward if we use `Array`
> for both Scala and Java as an example.
> Such cases are very few, and seems it's best to stick to what Spark has
> been done to support a single API for both Scala and Java.
>
>
> 2020년 5월 11일 (월) 오후 10:45, Tom Graves 님이 작성:
>
>> So as I've already stated and it looks like 2 others have issues with
>> number 4 as written as well, I'm against you posting this as is.  I do not
>> think we should recommend 4 for public user facing Scala API.
>>
>> Also note the page you linked is a Databricks page, while I know we
>> reference it as a style guide, I do not believe we should be putting API
>> policy on that page, it should live on an Apache Spark page.
>>
>> I think if you want to implement an API policy like this it should go
>> through an official vote thread, not just a discuss thread where we have
>> not had a lot of feedback on it.
>>
>> Tom
>>
>>
>>
>> On Monday, May 11, 2020, 06:44:31 AM CDT, Hyukjin Kwon <
>> gurwls...@gmail.com> wrote:
>>
>>
>> I will wait a couple of more days and if there's no objection I hear, I
>> will document this at
>> https://github.com/databricks/scala-style-guide#java-interoperability.
>>
>> 2020년 5월 7일 (목) 오후 9:18, Hyukjin Kwon 님이 작성:
>>
>> Hi all, I would like to proceed this. Are there more thoughts on this? If
>> not, I would like to go ahead with the proposal here.
>>
>> 2020년 4월 30일 (목) 오후 10:54, Hyukjin Kwon 님이 작성:
>>
>> Nothing is urgent. I just don't want to leave it undecided and just keep
>> adding Java APIs inconsistently as it's currently happening.
>>
>> We should have a set of coherent APIs. It's very difficult to change APIs
>> once they are out in releases. I guess I have seen people here agree with
>> having a general guidance for the same reason at least - please let me know
>> if I'm taking it wrong.
>>
>> I don't think we should assume Java programmers know how Scala works with
>> Java types. Less assumtion might be better.
>>
>> I feel like we have things on the table to consider at this moment and
>> not much point of waiting indefinitely.
>>
>> But sure maybe I am wrong. We can wait for more feedback for a couple of
>> days.
>>
>>
>> On Thu, 30 Apr 2020, 18:59 ZHANG Wei,  wrote:
>>
>> I feel a little pushed... :-) I still don't get the point of why it's
>> urgent to make the decision now. AFAIK, it's a common practice to handle
>> Scala types conversions by self when Java programmers prepare to
>> invoke Scala libraries. I'm not sure which one is the Java programmers'
>> root complaint, Scala type instance or Scala Jar file.
>>
>> My 2 cents.
>>
>> --
>> Cheers,
>> -z
>>
>> On Thu, 30 Apr 2020 09:17:37 +0900
>> Hyukjin Kwon  wrote:
>>
>> > There was a typo in the previous email. I am re-sending:
>> >
>> > Hm, I thought you meant you prefer 3. over 4 but don't mind
>> particularly.
>> > I don't mean to wait for more feedback. It looks likely just a deadlock
>> > which will be the worst case.
>> > I was suggesting to pick one way first, and stick to it. If we find out
>> > something later, we can discuss
>> > more about changing it later.
>> >
>> > Having separate Java specific API (3. way)
>> >   - causes maintenance cost
>> >   - makes users to search which API for Java every time
>> >   - this looks the opposite why against the unified API set Spark
>> targeted
>> > so far.
>> >
>> > I don't completely buy the argument about Scala/Java friendly because
>> using
>> > Java instance is already documented in the official Scala documentation.
>> > Users still need to search if we have Java specific methods for *some*
>> APIs.
>> >
>> > 2020년 4월 30일 (목) 오전 8:58, Hyukjin Kwon 님이 작성:
>> >
>> > > Hm, I thought you meant you prefer 3. over 4 but don't mind
>> particularly.
>> > > I don't mean to wait for more feedback. It looks likely just a
>> deadlock
>> > > which will be the worst case.
>> > > I was suggesting to pick one way first, and stick to it. If we find

Re: [DISCUSS] Java specific APIs design concern and choice

2020-05-11 Thread Hyukjin Kwon
The guide is our official guide, see "Code Style Guide" in
http://spark.apache.org/contributing.html.
As I said this is a general guidance, instead of a hard strict policy. I
don't target to change existing APIs either.
I would like to not start the vote when I see the clear objection to
address, Tom. I would like to address it.

> So as I've already stated and it looks like 2 others have issues with
number 4 as written as well, I'm against you posting this as is.  I do not
think we should recommend 4 for public user facing Scala API

The main argument from you looks Scala/Java friendly (and Java users are
smaller than Scala).
The first argument is not quite correct because using Java is in the
official Scala guide. As I mentioned, it is not awkward if we use `Array`
for both Scala and Java as an example.
Such cases are very few, and seems it's best to stick to what Spark has
been done to support a single API for both Scala and Java.


2020년 5월 11일 (월) 오후 10:45, Tom Graves 님이 작성:

> So as I've already stated and it looks like 2 others have issues with
> number 4 as written as well, I'm against you posting this as is.  I do not
> think we should recommend 4 for public user facing Scala API.
>
> Also note the page you linked is a Databricks page, while I know we
> reference it as a style guide, I do not believe we should be putting API
> policy on that page, it should live on an Apache Spark page.
>
> I think if you want to implement an API policy like this it should go
> through an official vote thread, not just a discuss thread where we have
> not had a lot of feedback on it.
>
> Tom
>
>
>
> On Monday, May 11, 2020, 06:44:31 AM CDT, Hyukjin Kwon <
> gurwls...@gmail.com> wrote:
>
>
> I will wait a couple of more days and if there's no objection I hear, I
> will document this at
> https://github.com/databricks/scala-style-guide#java-interoperability.
>
> 2020년 5월 7일 (목) 오후 9:18, Hyukjin Kwon 님이 작성:
>
> Hi all, I would like to proceed this. Are there more thoughts on this? If
> not, I would like to go ahead with the proposal here.
>
> 2020년 4월 30일 (목) 오후 10:54, Hyukjin Kwon 님이 작성:
>
> Nothing is urgent. I just don't want to leave it undecided and just keep
> adding Java APIs inconsistently as it's currently happening.
>
> We should have a set of coherent APIs. It's very difficult to change APIs
> once they are out in releases. I guess I have seen people here agree with
> having a general guidance for the same reason at least - please let me know
> if I'm taking it wrong.
>
> I don't think we should assume Java programmers know how Scala works with
> Java types. Less assumtion might be better.
>
> I feel like we have things on the table to consider at this moment and not
> much point of waiting indefinitely.
>
> But sure maybe I am wrong. We can wait for more feedback for a couple of
> days.
>
>
> On Thu, 30 Apr 2020, 18:59 ZHANG Wei,  wrote:
>
> I feel a little pushed... :-) I still don't get the point of why it's
> urgent to make the decision now. AFAIK, it's a common practice to handle
> Scala types conversions by self when Java programmers prepare to
> invoke Scala libraries. I'm not sure which one is the Java programmers'
> root complaint, Scala type instance or Scala Jar file.
>
> My 2 cents.
>
> --
> Cheers,
> -z
>
> On Thu, 30 Apr 2020 09:17:37 +0900
> Hyukjin Kwon  wrote:
>
> > There was a typo in the previous email. I am re-sending:
> >
> > Hm, I thought you meant you prefer 3. over 4 but don't mind particularly.
> > I don't mean to wait for more feedback. It looks likely just a deadlock
> > which will be the worst case.
> > I was suggesting to pick one way first, and stick to it. If we find out
> > something later, we can discuss
> > more about changing it later.
> >
> > Having separate Java specific API (3. way)
> >   - causes maintenance cost
> >   - makes users to search which API for Java every time
> >   - this looks the opposite why against the unified API set Spark
> targeted
> > so far.
> >
> > I don't completely buy the argument about Scala/Java friendly because
> using
> > Java instance is already documented in the official Scala documentation.
> > Users still need to search if we have Java specific methods for *some*
> APIs.
> >
> > 2020년 4월 30일 (목) 오전 8:58, Hyukjin Kwon 님이 작성:
> >
> > > Hm, I thought you meant you prefer 3. over 4 but don't mind
> particularly.
> > > I don't mean to wait for more feedback. It looks likely just a deadlock
> > > which will be the worst case.
> > > I was suggesting to pick one way first, and stick to it. If we find out
> > > something later, we can discuss
> > > more about changing it later.
> > >
> > > Having separate Java specific API (4. way)
> > >   - causes maintenance cost
> > >   - makes users to search which API for Java every time
> > >   - this looks the opposite why against the unified API set Spark
> targeted
> > > so far.
> > >
> > > I don't completely buy the argument about Scala/Java friendly because
> > > using Java instance is already 

Re: [DISCUSS] Java specific APIs design concern and choice

2020-05-11 Thread Tom Graves
 So as I've already stated and it looks like 2 others have issues with number 4 
as written as well, I'm against you posting this as is.  I do not think we 
should recommend 4 for public user facing Scala API.
Also note the page you linked is a Databricks page, while I know we reference 
it as a style guide, I do not believe we should be putting API policy on that 
page, it should live on an Apache Spark page.
I think if you want to implement an API policy like this it should go through 
an official vote thread, not just a discuss thread where we have not had a lot 
of feedback on it.
Tom


On Monday, May 11, 2020, 06:44:31 AM CDT, Hyukjin Kwon 
 wrote:  
 
 I will wait a couple of more days and if there's no objection I hear, I will 
document this at 
https://github.com/databricks/scala-style-guide#java-interoperability.
2020년 5월 7일 (목) 오후 9:18, Hyukjin Kwon 님이 작성:

Hi all, I would like to proceed this. Are there more thoughts on this? If not, 
I would like to go ahead with the proposal here.

2020년 4월 30일 (목) 오후 10:54, Hyukjin Kwon 님이 작성:
Nothing is urgent. I just don't want to leave it undecided and just keep adding 
Java APIs inconsistently as it's currently happening.
We should have a set of coherent APIs. It's very difficult to change APIs once 
they are out in releases. I guess I have seen people here agree with having a 
general guidance for the same reason at least - please let me know if I'm 
taking it wrong.
I don't think we should assume Java programmers know how Scala works with Java 
types. Less assumtion might be better.
I feel like we have things on the table to consider at this moment and not much 
point of waiting indefinitely.
But sure maybe I am wrong. We can wait for more feedback for a couple of days.

On Thu, 30 Apr 2020, 18:59 ZHANG Wei,  wrote:

I feel a little pushed... :-) I still don't get the point of why it's
urgent to make the decision now. AFAIK, it's a common practice to handle
Scala types conversions by self when Java programmers prepare to
invoke Scala libraries. I'm not sure which one is the Java programmers'
root complaint, Scala type instance or Scala Jar file.

My 2 cents.

-- 
Cheers,
-z

On Thu, 30 Apr 2020 09:17:37 +0900
Hyukjin Kwon  wrote:

> There was a typo in the previous email. I am re-sending:
> 
> Hm, I thought you meant you prefer 3. over 4 but don't mind particularly.
> I don't mean to wait for more feedback. It looks likely just a deadlock
> which will be the worst case.
> I was suggesting to pick one way first, and stick to it. If we find out
> something later, we can discuss
> more about changing it later.
> 
> Having separate Java specific API (3. way)
>   - causes maintenance cost
>   - makes users to search which API for Java every time
>   - this looks the opposite why against the unified API set Spark targeted
> so far.
> 
> I don't completely buy the argument about Scala/Java friendly because using
> Java instance is already documented in the official Scala documentation.
> Users still need to search if we have Java specific methods for *some* APIs.
> 
> 2020년 4월 30일 (목) 오전 8:58, Hyukjin Kwon 님이 작성:
> 
> > Hm, I thought you meant you prefer 3. over 4 but don't mind particularly.
> > I don't mean to wait for more feedback. It looks likely just a deadlock
> > which will be the worst case.
> > I was suggesting to pick one way first, and stick to it. If we find out
> > something later, we can discuss
> > more about changing it later.
> >
> > Having separate Java specific API (4. way)
> >   - causes maintenance cost
> >   - makes users to search which API for Java every time
> >   - this looks the opposite why against the unified API set Spark targeted
> > so far.
> >
> > I don't completely buy the argument about Scala/Java friendly because
> > using Java instance is already documented in the official Scala
> > documentation.
> > Users still need to search if we have Java specific methods for *some*
> > APIs.
> >
> >
> >
> > On Thu, 30 Apr 2020, 00:06 Tom Graves,  wrote:
> >
> >> Sorry I'm not sure what your last email means. Does it mean you are
> >> putting it up for a vote or just waiting to get more feedback?  I disagree
> >> with saying option 4 is the rule but agree having a general rule makes
> >> sense.  I think we need a lot more input to make the rule as it affects the
> >> api's.
> >>
> >> Tom
> >>
> >> On Wednesday, April 29, 2020, 09:53:22 AM CDT, Hyukjin Kwon <
> >> gurwls...@gmail.com> wrote:
> >>
> >>
> >> I think I am not seeing explicit objection here but rather see people
> >> tend to agree with the proposal in general.
> >> I would like to step forward rather than leaving it as a deadlock - the
> >> worst choice here is to postpone and abandon this discussion with this
> >> inconsistency.
> >>
> >> I don't currently target to document this as the cases are rather
> >> rare, and we haven't really documented JavaRDD <> RDD vs DataFrame case as
> >> well.
> >> Let's keep monitoring and see if this discussion thread clarifies 

Re: [DISCUSS] Java specific APIs design concern and choice

2020-05-11 Thread Hyukjin Kwon
I will wait a couple of more days and if there's no objection I hear, I
will document this at
https://github.com/databricks/scala-style-guide#java-interoperability.

2020년 5월 7일 (목) 오후 9:18, Hyukjin Kwon 님이 작성:

> Hi all, I would like to proceed this. Are there more thoughts on this? If
> not, I would like to go ahead with the proposal here.
>
> 2020년 4월 30일 (목) 오후 10:54, Hyukjin Kwon 님이 작성:
>
>> Nothing is urgent. I just don't want to leave it undecided and just keep
>> adding Java APIs inconsistently as it's currently happening.
>>
>> We should have a set of coherent APIs. It's very difficult to change APIs
>> once they are out in releases. I guess I have seen people here agree with
>> having a general guidance for the same reason at least - please let me know
>> if I'm taking it wrong.
>>
>> I don't think we should assume Java programmers know how Scala works with
>> Java types. Less assumtion might be better.
>>
>> I feel like we have things on the table to consider at this moment and
>> not much point of waiting indefinitely.
>>
>> But sure maybe I am wrong. We can wait for more feedback for a couple of
>> days.
>>
>>
>> On Thu, 30 Apr 2020, 18:59 ZHANG Wei,  wrote:
>>
>>> I feel a little pushed... :-) I still don't get the point of why it's
>>> urgent to make the decision now. AFAIK, it's a common practice to handle
>>> Scala types conversions by self when Java programmers prepare to
>>> invoke Scala libraries. I'm not sure which one is the Java programmers'
>>> root complaint, Scala type instance or Scala Jar file.
>>>
>>> My 2 cents.
>>>
>>> --
>>> Cheers,
>>> -z
>>>
>>> On Thu, 30 Apr 2020 09:17:37 +0900
>>> Hyukjin Kwon  wrote:
>>>
>>> > There was a typo in the previous email. I am re-sending:
>>> >
>>> > Hm, I thought you meant you prefer 3. over 4 but don't mind
>>> particularly.
>>> > I don't mean to wait for more feedback. It looks likely just a deadlock
>>> > which will be the worst case.
>>> > I was suggesting to pick one way first, and stick to it. If we find out
>>> > something later, we can discuss
>>> > more about changing it later.
>>> >
>>> > Having separate Java specific API (3. way)
>>> >   - causes maintenance cost
>>> >   - makes users to search which API for Java every time
>>> >   - this looks the opposite why against the unified API set Spark
>>> targeted
>>> > so far.
>>> >
>>> > I don't completely buy the argument about Scala/Java friendly because
>>> using
>>> > Java instance is already documented in the official Scala
>>> documentation.
>>> > Users still need to search if we have Java specific methods for *some*
>>> APIs.
>>> >
>>> > 2020년 4월 30일 (목) 오전 8:58, Hyukjin Kwon 님이 작성:
>>> >
>>> > > Hm, I thought you meant you prefer 3. over 4 but don't mind
>>> particularly.
>>> > > I don't mean to wait for more feedback. It looks likely just a
>>> deadlock
>>> > > which will be the worst case.
>>> > > I was suggesting to pick one way first, and stick to it. If we find
>>> out
>>> > > something later, we can discuss
>>> > > more about changing it later.
>>> > >
>>> > > Having separate Java specific API (4. way)
>>> > >   - causes maintenance cost
>>> > >   - makes users to search which API for Java every time
>>> > >   - this looks the opposite why against the unified API set Spark
>>> targeted
>>> > > so far.
>>> > >
>>> > > I don't completely buy the argument about Scala/Java friendly because
>>> > > using Java instance is already documented in the official Scala
>>> > > documentation.
>>> > > Users still need to search if we have Java specific methods for
>>> *some*
>>> > > APIs.
>>> > >
>>> > >
>>> > >
>>> > > On Thu, 30 Apr 2020, 00:06 Tom Graves,  wrote:
>>> > >
>>> > >> Sorry I'm not sure what your last email means. Does it mean you are
>>> > >> putting it up for a vote or just waiting to get more feedback?  I
>>> disagree
>>> > >> with saying option 4 is the rule but agree having a general rule
>>> makes
>>> > >> sense.  I think we need a lot more input to make the rule as it
>>> affects the
>>> > >> api's.
>>> > >>
>>> > >> Tom
>>> > >>
>>> > >> On Wednesday, April 29, 2020, 09:53:22 AM CDT, Hyukjin Kwon <
>>> > >> gurwls...@gmail.com> wrote:
>>> > >>
>>> > >>
>>> > >> I think I am not seeing explicit objection here but rather see
>>> people
>>> > >> tend to agree with the proposal in general.
>>> > >> I would like to step forward rather than leaving it as a deadlock -
>>> the
>>> > >> worst choice here is to postpone and abandon this discussion with
>>> this
>>> > >> inconsistency.
>>> > >>
>>> > >> I don't currently target to document this as the cases are rather
>>> > >> rare, and we haven't really documented JavaRDD <> RDD vs DataFrame
>>> case as
>>> > >> well.
>>> > >> Let's keep monitoring and see if this discussion thread clarifies
>>> things
>>> > >> enough in such cases I mentioned.
>>> > >>
>>> > >> Let me know if you guys think differently.
>>> > >>
>>> > >>
>>> > >> 2020년 4월 28일 (화) 오후 5:03, Hyukjin Kwon 님이 작성:
>>> > >>
>>> > >> Spark has targeted to have a 

Re: [DISCUSS] Java specific APIs design concern and choice

2020-05-07 Thread Hyukjin Kwon
Hi all, I would like to proceed this. Are there more thoughts on this? If
not, I would like to go ahead with the proposal here.

2020년 4월 30일 (목) 오후 10:54, Hyukjin Kwon 님이 작성:

> Nothing is urgent. I just don't want to leave it undecided and just keep
> adding Java APIs inconsistently as it's currently happening.
>
> We should have a set of coherent APIs. It's very difficult to change APIs
> once they are out in releases. I guess I have seen people here agree with
> having a general guidance for the same reason at least - please let me know
> if I'm taking it wrong.
>
> I don't think we should assume Java programmers know how Scala works with
> Java types. Less assumtion might be better.
>
> I feel like we have things on the table to consider at this moment and not
> much point of waiting indefinitely.
>
> But sure maybe I am wrong. We can wait for more feedback for a couple of
> days.
>
>
> On Thu, 30 Apr 2020, 18:59 ZHANG Wei,  wrote:
>
>> I feel a little pushed... :-) I still don't get the point of why it's
>> urgent to make the decision now. AFAIK, it's a common practice to handle
>> Scala types conversions by self when Java programmers prepare to
>> invoke Scala libraries. I'm not sure which one is the Java programmers'
>> root complaint, Scala type instance or Scala Jar file.
>>
>> My 2 cents.
>>
>> --
>> Cheers,
>> -z
>>
>> On Thu, 30 Apr 2020 09:17:37 +0900
>> Hyukjin Kwon  wrote:
>>
>> > There was a typo in the previous email. I am re-sending:
>> >
>> > Hm, I thought you meant you prefer 3. over 4 but don't mind
>> particularly.
>> > I don't mean to wait for more feedback. It looks likely just a deadlock
>> > which will be the worst case.
>> > I was suggesting to pick one way first, and stick to it. If we find out
>> > something later, we can discuss
>> > more about changing it later.
>> >
>> > Having separate Java specific API (3. way)
>> >   - causes maintenance cost
>> >   - makes users to search which API for Java every time
>> >   - this looks the opposite why against the unified API set Spark
>> targeted
>> > so far.
>> >
>> > I don't completely buy the argument about Scala/Java friendly because
>> using
>> > Java instance is already documented in the official Scala documentation.
>> > Users still need to search if we have Java specific methods for *some*
>> APIs.
>> >
>> > 2020년 4월 30일 (목) 오전 8:58, Hyukjin Kwon 님이 작성:
>> >
>> > > Hm, I thought you meant you prefer 3. over 4 but don't mind
>> particularly.
>> > > I don't mean to wait for more feedback. It looks likely just a
>> deadlock
>> > > which will be the worst case.
>> > > I was suggesting to pick one way first, and stick to it. If we find
>> out
>> > > something later, we can discuss
>> > > more about changing it later.
>> > >
>> > > Having separate Java specific API (4. way)
>> > >   - causes maintenance cost
>> > >   - makes users to search which API for Java every time
>> > >   - this looks the opposite why against the unified API set Spark
>> targeted
>> > > so far.
>> > >
>> > > I don't completely buy the argument about Scala/Java friendly because
>> > > using Java instance is already documented in the official Scala
>> > > documentation.
>> > > Users still need to search if we have Java specific methods for *some*
>> > > APIs.
>> > >
>> > >
>> > >
>> > > On Thu, 30 Apr 2020, 00:06 Tom Graves,  wrote:
>> > >
>> > >> Sorry I'm not sure what your last email means. Does it mean you are
>> > >> putting it up for a vote or just waiting to get more feedback?  I
>> disagree
>> > >> with saying option 4 is the rule but agree having a general rule
>> makes
>> > >> sense.  I think we need a lot more input to make the rule as it
>> affects the
>> > >> api's.
>> > >>
>> > >> Tom
>> > >>
>> > >> On Wednesday, April 29, 2020, 09:53:22 AM CDT, Hyukjin Kwon <
>> > >> gurwls...@gmail.com> wrote:
>> > >>
>> > >>
>> > >> I think I am not seeing explicit objection here but rather see people
>> > >> tend to agree with the proposal in general.
>> > >> I would like to step forward rather than leaving it as a deadlock -
>> the
>> > >> worst choice here is to postpone and abandon this discussion with
>> this
>> > >> inconsistency.
>> > >>
>> > >> I don't currently target to document this as the cases are rather
>> > >> rare, and we haven't really documented JavaRDD <> RDD vs DataFrame
>> case as
>> > >> well.
>> > >> Let's keep monitoring and see if this discussion thread clarifies
>> things
>> > >> enough in such cases I mentioned.
>> > >>
>> > >> Let me know if you guys think differently.
>> > >>
>> > >>
>> > >> 2020년 4월 28일 (화) 오후 5:03, Hyukjin Kwon 님이 작성:
>> > >>
>> > >> Spark has targeted to have a unified API set rather than having
>> separate
>> > >> Java classes to reduce the maintenance cost,
>> > >> e.g.) JavaRDD <> RDD vs DataFrame. These JavaXXX are more about the
>> > >> legacy.
>> > >>
>> > >> I think it's best to stick to the approach 4. in general cases.
>> > >> Other options might have to be considered based upon a specific
>> context.

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-30 Thread Hyukjin Kwon
Nothing is urgent. I just don't want to leave it undecided and just keep
adding Java APIs inconsistently as it's currently happening.

We should have a set of coherent APIs. It's very difficult to change APIs
once they are out in releases. I guess I have seen people here agree with
having a general guidance for the same reason at least - please let me know
if I'm taking it wrong.

I don't think we should assume Java programmers know how Scala works with
Java types. Less assumtion might be better.

I feel like we have things on the table to consider at this moment and not
much point of waiting indefinitely.

But sure maybe I am wrong. We can wait for more feedback for a couple of
days.


On Thu, 30 Apr 2020, 18:59 ZHANG Wei,  wrote:

> I feel a little pushed... :-) I still don't get the point of why it's
> urgent to make the decision now. AFAIK, it's a common practice to handle
> Scala types conversions by self when Java programmers prepare to
> invoke Scala libraries. I'm not sure which one is the Java programmers'
> root complaint, Scala type instance or Scala Jar file.
>
> My 2 cents.
>
> --
> Cheers,
> -z
>
> On Thu, 30 Apr 2020 09:17:37 +0900
> Hyukjin Kwon  wrote:
>
> > There was a typo in the previous email. I am re-sending:
> >
> > Hm, I thought you meant you prefer 3. over 4 but don't mind particularly.
> > I don't mean to wait for more feedback. It looks likely just a deadlock
> > which will be the worst case.
> > I was suggesting to pick one way first, and stick to it. If we find out
> > something later, we can discuss
> > more about changing it later.
> >
> > Having separate Java specific API (3. way)
> >   - causes maintenance cost
> >   - makes users to search which API for Java every time
> >   - this looks the opposite why against the unified API set Spark
> targeted
> > so far.
> >
> > I don't completely buy the argument about Scala/Java friendly because
> using
> > Java instance is already documented in the official Scala documentation.
> > Users still need to search if we have Java specific methods for *some*
> APIs.
> >
> > 2020년 4월 30일 (목) 오전 8:58, Hyukjin Kwon 님이 작성:
> >
> > > Hm, I thought you meant you prefer 3. over 4 but don't mind
> particularly.
> > > I don't mean to wait for more feedback. It looks likely just a deadlock
> > > which will be the worst case.
> > > I was suggesting to pick one way first, and stick to it. If we find out
> > > something later, we can discuss
> > > more about changing it later.
> > >
> > > Having separate Java specific API (4. way)
> > >   - causes maintenance cost
> > >   - makes users to search which API for Java every time
> > >   - this looks the opposite why against the unified API set Spark
> targeted
> > > so far.
> > >
> > > I don't completely buy the argument about Scala/Java friendly because
> > > using Java instance is already documented in the official Scala
> > > documentation.
> > > Users still need to search if we have Java specific methods for *some*
> > > APIs.
> > >
> > >
> > >
> > > On Thu, 30 Apr 2020, 00:06 Tom Graves,  wrote:
> > >
> > >> Sorry I'm not sure what your last email means. Does it mean you are
> > >> putting it up for a vote or just waiting to get more feedback?  I
> disagree
> > >> with saying option 4 is the rule but agree having a general rule makes
> > >> sense.  I think we need a lot more input to make the rule as it
> affects the
> > >> api's.
> > >>
> > >> Tom
> > >>
> > >> On Wednesday, April 29, 2020, 09:53:22 AM CDT, Hyukjin Kwon <
> > >> gurwls...@gmail.com> wrote:
> > >>
> > >>
> > >> I think I am not seeing explicit objection here but rather see people
> > >> tend to agree with the proposal in general.
> > >> I would like to step forward rather than leaving it as a deadlock -
> the
> > >> worst choice here is to postpone and abandon this discussion with this
> > >> inconsistency.
> > >>
> > >> I don't currently target to document this as the cases are rather
> > >> rare, and we haven't really documented JavaRDD <> RDD vs DataFrame
> case as
> > >> well.
> > >> Let's keep monitoring and see if this discussion thread clarifies
> things
> > >> enough in such cases I mentioned.
> > >>
> > >> Let me know if you guys think differently.
> > >>
> > >>
> > >> 2020년 4월 28일 (화) 오후 5:03, Hyukjin Kwon 님이 작성:
> > >>
> > >> Spark has targeted to have a unified API set rather than having
> separate
> > >> Java classes to reduce the maintenance cost,
> > >> e.g.) JavaRDD <> RDD vs DataFrame. These JavaXXX are more about the
> > >> legacy.
> > >>
> > >> I think it's best to stick to the approach 4. in general cases.
> > >> Other options might have to be considered based upon a specific
> context.
> > >> For example, if we *must* to add a bunch of Java-specifics
> > >> into a specific class for an inevitable reason somewhere, I would
> > >> consider to have a Java-specific class.
> > >>
> > >>
> > >>
> > >> 2020년 4월 28일 (화) 오후 4:38, ZHANG Wei 님이 작성:
> > >>
> > >> Be frankly, I also love the pure Java type in Java API and 

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-30 Thread ZHANG Wei
I feel a little pushed... :-) I still don't get the point of why it's
urgent to make the decision now. AFAIK, it's a common practice to handle
Scala types conversions by self when Java programmers prepare to
invoke Scala libraries. I'm not sure which one is the Java programmers'
root complaint, Scala type instance or Scala Jar file.

My 2 cents.

-- 
Cheers,
-z

On Thu, 30 Apr 2020 09:17:37 +0900
Hyukjin Kwon  wrote:

> There was a typo in the previous email. I am re-sending:
> 
> Hm, I thought you meant you prefer 3. over 4 but don't mind particularly.
> I don't mean to wait for more feedback. It looks likely just a deadlock
> which will be the worst case.
> I was suggesting to pick one way first, and stick to it. If we find out
> something later, we can discuss
> more about changing it later.
> 
> Having separate Java specific API (3. way)
>   - causes maintenance cost
>   - makes users to search which API for Java every time
>   - this looks the opposite why against the unified API set Spark targeted
> so far.
> 
> I don't completely buy the argument about Scala/Java friendly because using
> Java instance is already documented in the official Scala documentation.
> Users still need to search if we have Java specific methods for *some* APIs.
> 
> 2020년 4월 30일 (목) 오전 8:58, Hyukjin Kwon 님이 작성:
> 
> > Hm, I thought you meant you prefer 3. over 4 but don't mind particularly.
> > I don't mean to wait for more feedback. It looks likely just a deadlock
> > which will be the worst case.
> > I was suggesting to pick one way first, and stick to it. If we find out
> > something later, we can discuss
> > more about changing it later.
> >
> > Having separate Java specific API (4. way)
> >   - causes maintenance cost
> >   - makes users to search which API for Java every time
> >   - this looks the opposite why against the unified API set Spark targeted
> > so far.
> >
> > I don't completely buy the argument about Scala/Java friendly because
> > using Java instance is already documented in the official Scala
> > documentation.
> > Users still need to search if we have Java specific methods for *some*
> > APIs.
> >
> >
> >
> > On Thu, 30 Apr 2020, 00:06 Tom Graves,  wrote:
> >
> >> Sorry I'm not sure what your last email means. Does it mean you are
> >> putting it up for a vote or just waiting to get more feedback?  I disagree
> >> with saying option 4 is the rule but agree having a general rule makes
> >> sense.  I think we need a lot more input to make the rule as it affects the
> >> api's.
> >>
> >> Tom
> >>
> >> On Wednesday, April 29, 2020, 09:53:22 AM CDT, Hyukjin Kwon <
> >> gurwls...@gmail.com> wrote:
> >>
> >>
> >> I think I am not seeing explicit objection here but rather see people
> >> tend to agree with the proposal in general.
> >> I would like to step forward rather than leaving it as a deadlock - the
> >> worst choice here is to postpone and abandon this discussion with this
> >> inconsistency.
> >>
> >> I don't currently target to document this as the cases are rather
> >> rare, and we haven't really documented JavaRDD <> RDD vs DataFrame case as
> >> well.
> >> Let's keep monitoring and see if this discussion thread clarifies things
> >> enough in such cases I mentioned.
> >>
> >> Let me know if you guys think differently.
> >>
> >>
> >> 2020년 4월 28일 (화) 오후 5:03, Hyukjin Kwon 님이 작성:
> >>
> >> Spark has targeted to have a unified API set rather than having separate
> >> Java classes to reduce the maintenance cost,
> >> e.g.) JavaRDD <> RDD vs DataFrame. These JavaXXX are more about the
> >> legacy.
> >>
> >> I think it's best to stick to the approach 4. in general cases.
> >> Other options might have to be considered based upon a specific context.
> >> For example, if we *must* to add a bunch of Java-specifics
> >> into a specific class for an inevitable reason somewhere, I would
> >> consider to have a Java-specific class.
> >>
> >>
> >>
> >> 2020년 4월 28일 (화) 오후 4:38, ZHANG Wei 님이 작성:
> >>
> >> Be frankly, I also love the pure Java type in Java API and Scala type in
> >> Scala API. :-)
> >>
> >> If we don't treat Java as a "FRIEND" of Scala, just as Python, maybe we
> >> can adopt the status of option 1, the specific Java classes. (But I don't
> >> like the `Java` prefix, which is redundant when I'm coding Java app,
> >> such as JavaRDD, why not distinct it by package namespace...) The specific
> >> Java API can also leverage some native Java language features with new
> >> versions.
> >>
> >> And just since the friendly relationship between Scala and Java, the Java
> >> user can call Scala API with `.asScala` or `.asJava`'s help if Java API
> >> is not ready. Then switch to Java API when it's well cooked.
> >>
> >> The cons is more efforts to maintain.
> >>
> >> My 2 cents.
> >>
> >> --
> >> Cheers,
> >> -z
> >>
> >> On Tue, 28 Apr 2020 12:07:36 +0900
> >> Hyukjin Kwon  wrote:
> >>
> >> > The problem is that calling Scala instances in Java side is discouraged
> >> in
> >> > general up to my best 

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-29 Thread Hyukjin Kwon
There was a typo in the previous email. I am re-sending:

Hm, I thought you meant you prefer 3. over 4 but don't mind particularly.
I don't mean to wait for more feedback. It looks likely just a deadlock
which will be the worst case.
I was suggesting to pick one way first, and stick to it. If we find out
something later, we can discuss
more about changing it later.

Having separate Java specific API (3. way)
  - causes maintenance cost
  - makes users to search which API for Java every time
  - this looks the opposite why against the unified API set Spark targeted
so far.

I don't completely buy the argument about Scala/Java friendly because using
Java instance is already documented in the official Scala documentation.
Users still need to search if we have Java specific methods for *some* APIs.

2020년 4월 30일 (목) 오전 8:58, Hyukjin Kwon 님이 작성:

> Hm, I thought you meant you prefer 3. over 4 but don't mind particularly.
> I don't mean to wait for more feedback. It looks likely just a deadlock
> which will be the worst case.
> I was suggesting to pick one way first, and stick to it. If we find out
> something later, we can discuss
> more about changing it later.
>
> Having separate Java specific API (4. way)
>   - causes maintenance cost
>   - makes users to search which API for Java every time
>   - this looks the opposite why against the unified API set Spark targeted
> so far.
>
> I don't completely buy the argument about Scala/Java friendly because
> using Java instance is already documented in the official Scala
> documentation.
> Users still need to search if we have Java specific methods for *some*
> APIs.
>
>
>
> On Thu, 30 Apr 2020, 00:06 Tom Graves,  wrote:
>
>> Sorry I'm not sure what your last email means. Does it mean you are
>> putting it up for a vote or just waiting to get more feedback?  I disagree
>> with saying option 4 is the rule but agree having a general rule makes
>> sense.  I think we need a lot more input to make the rule as it affects the
>> api's.
>>
>> Tom
>>
>> On Wednesday, April 29, 2020, 09:53:22 AM CDT, Hyukjin Kwon <
>> gurwls...@gmail.com> wrote:
>>
>>
>> I think I am not seeing explicit objection here but rather see people
>> tend to agree with the proposal in general.
>> I would like to step forward rather than leaving it as a deadlock - the
>> worst choice here is to postpone and abandon this discussion with this
>> inconsistency.
>>
>> I don't currently target to document this as the cases are rather
>> rare, and we haven't really documented JavaRDD <> RDD vs DataFrame case as
>> well.
>> Let's keep monitoring and see if this discussion thread clarifies things
>> enough in such cases I mentioned.
>>
>> Let me know if you guys think differently.
>>
>>
>> 2020년 4월 28일 (화) 오후 5:03, Hyukjin Kwon 님이 작성:
>>
>> Spark has targeted to have a unified API set rather than having separate
>> Java classes to reduce the maintenance cost,
>> e.g.) JavaRDD <> RDD vs DataFrame. These JavaXXX are more about the
>> legacy.
>>
>> I think it's best to stick to the approach 4. in general cases.
>> Other options might have to be considered based upon a specific context.
>> For example, if we *must* to add a bunch of Java-specifics
>> into a specific class for an inevitable reason somewhere, I would
>> consider to have a Java-specific class.
>>
>>
>>
>> 2020년 4월 28일 (화) 오후 4:38, ZHANG Wei 님이 작성:
>>
>> Be frankly, I also love the pure Java type in Java API and Scala type in
>> Scala API. :-)
>>
>> If we don't treat Java as a "FRIEND" of Scala, just as Python, maybe we
>> can adopt the status of option 1, the specific Java classes. (But I don't
>> like the `Java` prefix, which is redundant when I'm coding Java app,
>> such as JavaRDD, why not distinct it by package namespace...) The specific
>> Java API can also leverage some native Java language features with new
>> versions.
>>
>> And just since the friendly relationship between Scala and Java, the Java
>> user can call Scala API with `.asScala` or `.asJava`'s help if Java API
>> is not ready. Then switch to Java API when it's well cooked.
>>
>> The cons is more efforts to maintain.
>>
>> My 2 cents.
>>
>> --
>> Cheers,
>> -z
>>
>> On Tue, 28 Apr 2020 12:07:36 +0900
>> Hyukjin Kwon  wrote:
>>
>> > The problem is that calling Scala instances in Java side is discouraged
>> in
>> > general up to my best knowledge.
>> > A Java user won't likely know asJava in Scala but a Scala user will
>> likely
>> > know both asScala and asJava.
>> >
>> >
>> > 2020년 4월 28일 (화) 오전 11:35, ZHANG Wei 님이 작성:
>> >
>> > > How about making a small change on option 4:
>> > >   Keep Scala API returning Scala type instance with providing a
>> > >   `asJava` method to return a Java type instance.
>> > >
>> > > Scala 2.13 has provided CollectionConverter [1][2][3], in the
>> following
>> > > Spark dependences upgrade, which can be supported by nature. For
>> > > current Scala 2.12 version, we can wrap `ImplicitConversionsToJava`[4]
>> > > as what Scala 2.13 does and add 

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-29 Thread Hyukjin Kwon
Hm, I thought you meant you prefer 3. over 4 but don't mind particularly.
I don't mean to wait for more feedback. It looks likely just a deadlock
which will be the worst case.
I was suggesting to pick one way first, and stick to it. If we find out
something later, we can discuss
more about changing it later.

Having separate Java specific API (4. way)
  - causes maintenance cost
  - makes users to search which API for Java every time
  - this looks the opposite why against the unified API set Spark targeted
so far.

I don't completely buy the argument about Scala/Java friendly because using
Java instance is already documented in the official Scala documentation.
Users still need to search if we have Java specific methods for *some* APIs.



On Thu, 30 Apr 2020, 00:06 Tom Graves,  wrote:

> Sorry I'm not sure what your last email means. Does it mean you are
> putting it up for a vote or just waiting to get more feedback?  I disagree
> with saying option 4 is the rule but agree having a general rule makes
> sense.  I think we need a lot more input to make the rule as it affects the
> api's.
>
> Tom
>
> On Wednesday, April 29, 2020, 09:53:22 AM CDT, Hyukjin Kwon <
> gurwls...@gmail.com> wrote:
>
>
> I think I am not seeing explicit objection here but rather see people tend
> to agree with the proposal in general.
> I would like to step forward rather than leaving it as a deadlock - the
> worst choice here is to postpone and abandon this discussion with this
> inconsistency.
>
> I don't currently target to document this as the cases are rather
> rare, and we haven't really documented JavaRDD <> RDD vs DataFrame case as
> well.
> Let's keep monitoring and see if this discussion thread clarifies things
> enough in such cases I mentioned.
>
> Let me know if you guys think differently.
>
>
> 2020년 4월 28일 (화) 오후 5:03, Hyukjin Kwon 님이 작성:
>
> Spark has targeted to have a unified API set rather than having separate
> Java classes to reduce the maintenance cost,
> e.g.) JavaRDD <> RDD vs DataFrame. These JavaXXX are more about the legacy.
>
> I think it's best to stick to the approach 4. in general cases.
> Other options might have to be considered based upon a specific context.
> For example, if we *must* to add a bunch of Java-specifics
> into a specific class for an inevitable reason somewhere, I would consider
> to have a Java-specific class.
>
>
>
> 2020년 4월 28일 (화) 오후 4:38, ZHANG Wei 님이 작성:
>
> Be frankly, I also love the pure Java type in Java API and Scala type in
> Scala API. :-)
>
> If we don't treat Java as a "FRIEND" of Scala, just as Python, maybe we
> can adopt the status of option 1, the specific Java classes. (But I don't
> like the `Java` prefix, which is redundant when I'm coding Java app,
> such as JavaRDD, why not distinct it by package namespace...) The specific
> Java API can also leverage some native Java language features with new
> versions.
>
> And just since the friendly relationship between Scala and Java, the Java
> user can call Scala API with `.asScala` or `.asJava`'s help if Java API
> is not ready. Then switch to Java API when it's well cooked.
>
> The cons is more efforts to maintain.
>
> My 2 cents.
>
> --
> Cheers,
> -z
>
> On Tue, 28 Apr 2020 12:07:36 +0900
> Hyukjin Kwon  wrote:
>
> > The problem is that calling Scala instances in Java side is discouraged
> in
> > general up to my best knowledge.
> > A Java user won't likely know asJava in Scala but a Scala user will
> likely
> > know both asScala and asJava.
> >
> >
> > 2020년 4월 28일 (화) 오전 11:35, ZHANG Wei 님이 작성:
> >
> > > How about making a small change on option 4:
> > >   Keep Scala API returning Scala type instance with providing a
> > >   `asJava` method to return a Java type instance.
> > >
> > > Scala 2.13 has provided CollectionConverter [1][2][3], in the following
> > > Spark dependences upgrade, which can be supported by nature. For
> > > current Scala 2.12 version, we can wrap `ImplicitConversionsToJava`[4]
> > > as what Scala 2.13 does and add implicit conversions.
> > >
> > > Just my 2 cents.
> > >
> > > --
> > > Cheers,
> > > -z
> > >
> > > [1]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.scala-lang.org%2Foverviews%2Fcollections-2.13%2Fconversions-between-java-and-scala-collections.htmldata=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166sdata=1qauveOMB1lKHSkRco7v8tBpcJXab8IeGlcoYNMCZ%2BU%3Dreserved=0
> > > [2]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.13.0%2Fscala%2Fjdk%2Fjavaapi%2FCollectionConverters%24.htmldata=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166sdata=%2B9TrlfiGSWDnsaT8DMPrSn1CqGIxtgfNLcPFRJ%2F%2FANQ%3Dreserved=0
> > > [3]
> > >
> 

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-29 Thread Tom Graves
 Sorry I'm not sure what your last email means. Does it mean you are putting it 
up for a vote or just waiting to get more feedback?  I disagree with saying 
option 4 is the rule but agree having a general rule makes sense.  I think we 
need a lot more input to make the rule as it affects the api's.
Tom 
On Wednesday, April 29, 2020, 09:53:22 AM CDT, Hyukjin Kwon 
 wrote:  
 
 I think I am not seeing explicit objection here but rather see people tend to 
agree with the proposal in general.
I would like to step forward rather than leaving it as a deadlock - the worst 
choice here is to postpone and abandon this discussion with this inconsistency.

I don't currently target to document this as the cases are rather rare, and we 
haven't really documented JavaRDD <> RDD vs DataFrame case as well.
Let's keep monitoring and see if this discussion thread clarifies things enough 
in such cases I mentioned.

Let me know if you guys think differently.


2020년 4월 28일 (화) 오후 5:03, Hyukjin Kwon 님이 작성:

Spark has targeted to have a unified API set rather than having separate Java 
classes to reduce the maintenance cost,
e.g.) JavaRDD <> RDD vs DataFrame. These JavaXXX are more about the legacy.

I think it's best to stick to the approach 4. in general cases.
Other options might have to be considered based upon a specific context. For 
example, if we must to add a bunch of Java-specifics
into a specific class for an inevitable reason somewhere, I would consider to 
have a Java-specific class.
 
2020년 4월 28일 (화) 오후 4:38, ZHANG Wei 님이 작성:

Be frankly, I also love the pure Java type in Java API and Scala type in
Scala API. :-)

If we don't treat Java as a "FRIEND" of Scala, just as Python, maybe we
can adopt the status of option 1, the specific Java classes. (But I don't
like the `Java` prefix, which is redundant when I'm coding Java app,
such as JavaRDD, why not distinct it by package namespace...) The specific
Java API can also leverage some native Java language features with new
versions.

And just since the friendly relationship between Scala and Java, the Java
user can call Scala API with `.asScala` or `.asJava`'s help if Java API
is not ready. Then switch to Java API when it's well cooked.

The cons is more efforts to maintain. 

My 2 cents.

-- 
Cheers,
-z

On Tue, 28 Apr 2020 12:07:36 +0900
Hyukjin Kwon  wrote:

> The problem is that calling Scala instances in Java side is discouraged in
> general up to my best knowledge.
> A Java user won't likely know asJava in Scala but a Scala user will likely
> know both asScala and asJava.
> 
> 
> 2020년 4월 28일 (화) 오전 11:35, ZHANG Wei 님이 작성:
> 
> > How about making a small change on option 4:
> >   Keep Scala API returning Scala type instance with providing a
> >   `asJava` method to return a Java type instance.
> >
> > Scala 2.13 has provided CollectionConverter [1][2][3], in the following
> > Spark dependences upgrade, which can be supported by nature. For
> > current Scala 2.12 version, we can wrap `ImplicitConversionsToJava`[4]
> > as what Scala 2.13 does and add implicit conversions.
> >
> > Just my 2 cents.
> >
> > --
> > Cheers,
> > -z
> >
> > [1]
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.scala-lang.org%2Foverviews%2Fcollections-2.13%2Fconversions-between-java-and-scala-collections.htmldata=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166sdata=1qauveOMB1lKHSkRco7v8tBpcJXab8IeGlcoYNMCZ%2BU%3Dreserved=0
> > [2]
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.13.0%2Fscala%2Fjdk%2Fjavaapi%2FCollectionConverters%24.htmldata=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166sdata=%2B9TrlfiGSWDnsaT8DMPrSn1CqGIxtgfNLcPFRJ%2F%2FANQ%3Dreserved=0
> > [3]
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.13.0%2Fscala%2Fjdk%2FCollectionConverters%24.htmldata=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166sdata=EjocqFcoIho43wU3yvOEO9Vtvn2jTHliV88W%2BSOed9k%3Dreserved=0
> > [4]
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.12.11%2Fscala%2Fcollection%2Fconvert%2FImplicitConversionsToJava%24.htmldata=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166sdata=BpMYD30%2B2tXeaoIj0nNhlho8XUZOEYvT%2FzH%2FJ4WEK98%3Dreserved=0
> >
> >
> > On Tue, 28 Apr 2020 08:52:57 +0900
> > Hyukjin Kwon  wrote:
> >
> > > I would like to make sure I am open for other options that can be
> > > considered situationally and based on the context.
> > > It's okay, and I don't target to restrict this here. For example, DSv2, I
> > > understand it's written in Java because Java
> > > interfaces arguably brings better performance. That's why vectorized
> > > readers are written in Java too.
> > >

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-29 Thread Hyukjin Kwon
I think I am not seeing explicit objection here but rather see people tend
to agree with the proposal in general.
I would like to step forward rather than leaving it as a deadlock - the
worst choice here is to postpone and abandon this discussion with this
inconsistency.

I don't currently target to document this as the cases are rather rare, and
we haven't really documented JavaRDD <> RDD vs DataFrame case as well.
Let's keep monitoring and see if this discussion thread clarifies things
enough in such cases I mentioned.

Let me know if you guys think differently.


2020년 4월 28일 (화) 오후 5:03, Hyukjin Kwon 님이 작성:

> Spark has targeted to have a unified API set rather than having separate
> Java classes to reduce the maintenance cost,
> e.g.) JavaRDD <> RDD vs DataFrame. These JavaXXX are more about the legacy.
>
> I think it's best to stick to the approach 4. in general cases.
> Other options might have to be considered based upon a specific context.
> For example, if we *must* to add a bunch of Java-specifics
> into a specific class for an inevitable reason somewhere, I would consider
> to have a Java-specific class.
>
>
>
> 2020년 4월 28일 (화) 오후 4:38, ZHANG Wei 님이 작성:
>
>> Be frankly, I also love the pure Java type in Java API and Scala type in
>> Scala API. :-)
>>
>> If we don't treat Java as a "FRIEND" of Scala, just as Python, maybe we
>> can adopt the status of option 1, the specific Java classes. (But I don't
>> like the `Java` prefix, which is redundant when I'm coding Java app,
>> such as JavaRDD, why not distinct it by package namespace...) The specific
>> Java API can also leverage some native Java language features with new
>> versions.
>>
>> And just since the friendly relationship between Scala and Java, the Java
>> user can call Scala API with `.asScala` or `.asJava`'s help if Java API
>> is not ready. Then switch to Java API when it's well cooked.
>>
>> The cons is more efforts to maintain.
>>
>> My 2 cents.
>>
>> --
>> Cheers,
>> -z
>>
>> On Tue, 28 Apr 2020 12:07:36 +0900
>> Hyukjin Kwon  wrote:
>>
>> > The problem is that calling Scala instances in Java side is discouraged
>> in
>> > general up to my best knowledge.
>> > A Java user won't likely know asJava in Scala but a Scala user will
>> likely
>> > know both asScala and asJava.
>> >
>> >
>> > 2020년 4월 28일 (화) 오전 11:35, ZHANG Wei 님이 작성:
>> >
>> > > How about making a small change on option 4:
>> > >   Keep Scala API returning Scala type instance with providing a
>> > >   `asJava` method to return a Java type instance.
>> > >
>> > > Scala 2.13 has provided CollectionConverter [1][2][3], in the
>> following
>> > > Spark dependences upgrade, which can be supported by nature. For
>> > > current Scala 2.12 version, we can wrap `ImplicitConversionsToJava`[4]
>> > > as what Scala 2.13 does and add implicit conversions.
>> > >
>> > > Just my 2 cents.
>> > >
>> > > --
>> > > Cheers,
>> > > -z
>> > >
>> > > [1]
>> > >
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.scala-lang.org%2Foverviews%2Fcollections-2.13%2Fconversions-between-java-and-scala-collections.htmldata=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166sdata=1qauveOMB1lKHSkRco7v8tBpcJXab8IeGlcoYNMCZ%2BU%3Dreserved=0
>> > > [2]
>> > >
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.13.0%2Fscala%2Fjdk%2Fjavaapi%2FCollectionConverters%24.htmldata=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166sdata=%2B9TrlfiGSWDnsaT8DMPrSn1CqGIxtgfNLcPFRJ%2F%2FANQ%3Dreserved=0
>> > > [3]
>> > >
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.13.0%2Fscala%2Fjdk%2FCollectionConverters%24.htmldata=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166sdata=EjocqFcoIho43wU3yvOEO9Vtvn2jTHliV88W%2BSOed9k%3Dreserved=0
>> > > [4]
>> > >
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.12.11%2Fscala%2Fcollection%2Fconvert%2FImplicitConversionsToJava%24.htmldata=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166sdata=BpMYD30%2B2tXeaoIj0nNhlho8XUZOEYvT%2FzH%2FJ4WEK98%3Dreserved=0
>> > >
>> > >
>> > > On Tue, 28 Apr 2020 08:52:57 +0900
>> > > Hyukjin Kwon  wrote:
>> > >
>> > > > I would like to make sure I am open for other options that can be
>> > > > considered situationally and based on the context.
>> > > > It's okay, and I don't target to restrict this here. For example,
>> DSv2, I
>> > > > understand it's written in Java because Java
>> > > > interfaces arguably brings better performance. That's why vectorized
>> > > > readers are written in Java too.
>> > > >
>> > > > Maybe the "general" wasn't explicit in my previous email. Adding
>> APIs to
>> > > > return a Java instance is still
>> > > > rather rare in 

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-28 Thread Hyukjin Kwon
Spark has targeted to have a unified API set rather than having separate
Java classes to reduce the maintenance cost,
e.g.) JavaRDD <> RDD vs DataFrame. These JavaXXX are more about the legacy.

I think it's best to stick to the approach 4. in general cases.
Other options might have to be considered based upon a specific context.
For example, if we *must* to add a bunch of Java-specifics
into a specific class for an inevitable reason somewhere, I would consider
to have a Java-specific class.



2020년 4월 28일 (화) 오후 4:38, ZHANG Wei 님이 작성:

> Be frankly, I also love the pure Java type in Java API and Scala type in
> Scala API. :-)
>
> If we don't treat Java as a "FRIEND" of Scala, just as Python, maybe we
> can adopt the status of option 1, the specific Java classes. (But I don't
> like the `Java` prefix, which is redundant when I'm coding Java app,
> such as JavaRDD, why not distinct it by package namespace...) The specific
> Java API can also leverage some native Java language features with new
> versions.
>
> And just since the friendly relationship between Scala and Java, the Java
> user can call Scala API with `.asScala` or `.asJava`'s help if Java API
> is not ready. Then switch to Java API when it's well cooked.
>
> The cons is more efforts to maintain.
>
> My 2 cents.
>
> --
> Cheers,
> -z
>
> On Tue, 28 Apr 2020 12:07:36 +0900
> Hyukjin Kwon  wrote:
>
> > The problem is that calling Scala instances in Java side is discouraged
> in
> > general up to my best knowledge.
> > A Java user won't likely know asJava in Scala but a Scala user will
> likely
> > know both asScala and asJava.
> >
> >
> > 2020년 4월 28일 (화) 오전 11:35, ZHANG Wei 님이 작성:
> >
> > > How about making a small change on option 4:
> > >   Keep Scala API returning Scala type instance with providing a
> > >   `asJava` method to return a Java type instance.
> > >
> > > Scala 2.13 has provided CollectionConverter [1][2][3], in the following
> > > Spark dependences upgrade, which can be supported by nature. For
> > > current Scala 2.12 version, we can wrap `ImplicitConversionsToJava`[4]
> > > as what Scala 2.13 does and add implicit conversions.
> > >
> > > Just my 2 cents.
> > >
> > > --
> > > Cheers,
> > > -z
> > >
> > > [1]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.scala-lang.org%2Foverviews%2Fcollections-2.13%2Fconversions-between-java-and-scala-collections.htmldata=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166sdata=1qauveOMB1lKHSkRco7v8tBpcJXab8IeGlcoYNMCZ%2BU%3Dreserved=0
> > > [2]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.13.0%2Fscala%2Fjdk%2Fjavaapi%2FCollectionConverters%24.htmldata=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166sdata=%2B9TrlfiGSWDnsaT8DMPrSn1CqGIxtgfNLcPFRJ%2F%2FANQ%3Dreserved=0
> > > [3]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.13.0%2Fscala%2Fjdk%2FCollectionConverters%24.htmldata=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166sdata=EjocqFcoIho43wU3yvOEO9Vtvn2jTHliV88W%2BSOed9k%3Dreserved=0
> > > [4]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.12.11%2Fscala%2Fcollection%2Fconvert%2FImplicitConversionsToJava%24.htmldata=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166sdata=BpMYD30%2B2tXeaoIj0nNhlho8XUZOEYvT%2FzH%2FJ4WEK98%3Dreserved=0
> > >
> > >
> > > On Tue, 28 Apr 2020 08:52:57 +0900
> > > Hyukjin Kwon  wrote:
> > >
> > > > I would like to make sure I am open for other options that can be
> > > > considered situationally and based on the context.
> > > > It's okay, and I don't target to restrict this here. For example,
> DSv2, I
> > > > understand it's written in Java because Java
> > > > interfaces arguably brings better performance. That's why vectorized
> > > > readers are written in Java too.
> > > >
> > > > Maybe the "general" wasn't explicit in my previous email. Adding
> APIs to
> > > > return a Java instance is still
> > > > rather rare in general given my few years monitoring.
> > > > The problem I would more like to deal with is more about when we
> need to
> > > > add one or a couple of user-facing
> > > > Java-specific APIs to return Java instances, which is relatively more
> > > > frequent compared to when we need a bunch
> > > > of Java specific APIs.
> > > >
> > > > In this case, I think it should be guided to use 4. approach. There
> are
> > > > pros and cons between 3. and 4., of course.
> > > > But it looks to me 4. approach is closer to what Spark has targeted
> so
> > > far.
> > > >
> > > >
> > > >
> > > > 2020년 4월 28일 (화) 오전 8:34, Hyukjin Kwon 님이 작성:
> > > >
> > > > > > One thing we could do here is use Java collections internally and
> 

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-28 Thread Reynold Xin
The con is much more than just more effort to maintain a parallel API. It
puts the burden for all libraries and library developers to maintain a
parallel API as well. That’s one of the primary reasons we moved away from
this RDD vs JavaRDD approach in the old RDD API.


On Tue, Apr 28, 2020 at 12:38 AM ZHANG Wei  wrote:

> Be frankly, I also love the pure Java type in Java API and Scala type in
> Scala API. :-)
>
> If we don't treat Java as a "FRIEND" of Scala, just as Python, maybe we
> can adopt the status of option 1, the specific Java classes. (But I don't
> like the `Java` prefix, which is redundant when I'm coding Java app,
> such as JavaRDD, why not distinct it by package namespace...) The specific
> Java API can also leverage some native Java language features with new
> versions.
>
> And just since the friendly relationship between Scala and Java, the Java
> user can call Scala API with `.asScala` or `.asJava`'s help if Java API
> is not ready. Then switch to Java API when it's well cooked.
>
> The cons is more efforts to maintain.
>
> My 2 cents.
>
> --
> Cheers,
> -z
>
> On Tue, 28 Apr 2020 12:07:36 +0900
> Hyukjin Kwon  wrote:
>
> > The problem is that calling Scala instances in Java side is discouraged
> in
> > general up to my best knowledge.
> > A Java user won't likely know asJava in Scala but a Scala user will
> likely
> > know both asScala and asJava.
> >
> >
> > 2020년 4월 28일 (화) 오전 11:35, ZHANG Wei 님이 작성:
> >
> > > How about making a small change on option 4:
> > >   Keep Scala API returning Scala type instance with providing a
> > >   `asJava` method to return a Java type instance.
> > >
> > > Scala 2.13 has provided CollectionConverter [1][2][3], in the following
> > > Spark dependences upgrade, which can be supported by nature. For
> > > current Scala 2.12 version, we can wrap `ImplicitConversionsToJava`[4]
> > > as what Scala 2.13 does and add implicit conversions.
> > >
> > > Just my 2 cents.
> > >
> > > --
> > > Cheers,
> > > -z
> > >
> > > [1]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.scala-lang.org%2Foverviews%2Fcollections-2.13%2Fconversions-between-java-and-scala-collections.htmldata=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166sdata=1qauveOMB1lKHSkRco7v8tBpcJXab8IeGlcoYNMCZ%2BU%3Dreserved=0
> > > [2]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.13.0%2Fscala%2Fjdk%2Fjavaapi%2FCollectionConverters%24.htmldata=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166sdata=%2B9TrlfiGSWDnsaT8DMPrSn1CqGIxtgfNLcPFRJ%2F%2FANQ%3Dreserved=0
> > > [3]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.13.0%2Fscala%2Fjdk%2FCollectionConverters%24.htmldata=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166sdata=EjocqFcoIho43wU3yvOEO9Vtvn2jTHliV88W%2BSOed9k%3Dreserved=0
> > > [4]
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.12.11%2Fscala%2Fcollection%2Fconvert%2FImplicitConversionsToJava%24.htmldata=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166sdata=BpMYD30%2B2tXeaoIj0nNhlho8XUZOEYvT%2FzH%2FJ4WEK98%3Dreserved=0
> > >
> > >
> > > On Tue, 28 Apr 2020 08:52:57 +0900
> > > Hyukjin Kwon  wrote:
> > >
> > > > I would like to make sure I am open for other options that can be
> > > > considered situationally and based on the context.
> > > > It's okay, and I don't target to restrict this here. For example,
> DSv2, I
> > > > understand it's written in Java because Java
> > > > interfaces arguably brings better performance. That's why vectorized
> > > > readers are written in Java too.
> > > >
> > > > Maybe the "general" wasn't explicit in my previous email. Adding
> APIs to
> > > > return a Java instance is still
> > > > rather rare in general given my few years monitoring.
> > > > The problem I would more like to deal with is more about when we
> need to
> > > > add one or a couple of user-facing
> > > > Java-specific APIs to return Java instances, which is relatively more
> > > > frequent compared to when we need a bunch
> > > > of Java specific APIs.
> > > >
> > > > In this case, I think it should be guided to use 4. approach. There
> are
> > > > pros and cons between 3. and 4., of course.
> > > > But it looks to me 4. approach is closer to what Spark has targeted
> so
> > > far.
> > > >
> > > >
> > > >
> > > > 2020년 4월 28일 (화) 오전 8:34, Hyukjin Kwon 님이 작성:
> > > >
> > > > > > One thing we could do here is use Java collections internally and
> > > make
> > > > > the Scala API a thin wrapper around Java -- like how Python works.
> > > > > > Then adding a method to the Scala API would require adding it to
> the
> > > > > Java API and we would keep the two more in 

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-28 Thread ZHANG Wei
Be frankly, I also love the pure Java type in Java API and Scala type in
Scala API. :-)

If we don't treat Java as a "FRIEND" of Scala, just as Python, maybe we
can adopt the status of option 1, the specific Java classes. (But I don't
like the `Java` prefix, which is redundant when I'm coding Java app,
such as JavaRDD, why not distinct it by package namespace...) The specific
Java API can also leverage some native Java language features with new
versions.

And just since the friendly relationship between Scala and Java, the Java
user can call Scala API with `.asScala` or `.asJava`'s help if Java API
is not ready. Then switch to Java API when it's well cooked.

The cons is more efforts to maintain. 

My 2 cents.

-- 
Cheers,
-z

On Tue, 28 Apr 2020 12:07:36 +0900
Hyukjin Kwon  wrote:

> The problem is that calling Scala instances in Java side is discouraged in
> general up to my best knowledge.
> A Java user won't likely know asJava in Scala but a Scala user will likely
> know both asScala and asJava.
> 
> 
> 2020년 4월 28일 (화) 오전 11:35, ZHANG Wei 님이 작성:
> 
> > How about making a small change on option 4:
> >   Keep Scala API returning Scala type instance with providing a
> >   `asJava` method to return a Java type instance.
> >
> > Scala 2.13 has provided CollectionConverter [1][2][3], in the following
> > Spark dependences upgrade, which can be supported by nature. For
> > current Scala 2.12 version, we can wrap `ImplicitConversionsToJava`[4]
> > as what Scala 2.13 does and add implicit conversions.
> >
> > Just my 2 cents.
> >
> > --
> > Cheers,
> > -z
> >
> > [1]
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.scala-lang.org%2Foverviews%2Fcollections-2.13%2Fconversions-between-java-and-scala-collections.htmldata=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166sdata=1qauveOMB1lKHSkRco7v8tBpcJXab8IeGlcoYNMCZ%2BU%3Dreserved=0
> > [2]
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.13.0%2Fscala%2Fjdk%2Fjavaapi%2FCollectionConverters%24.htmldata=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166sdata=%2B9TrlfiGSWDnsaT8DMPrSn1CqGIxtgfNLcPFRJ%2F%2FANQ%3Dreserved=0
> > [3]
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.13.0%2Fscala%2Fjdk%2FCollectionConverters%24.htmldata=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166sdata=EjocqFcoIho43wU3yvOEO9Vtvn2jTHliV88W%2BSOed9k%3Dreserved=0
> > [4]
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.scala-lang.org%2Fapi%2F2.12.11%2Fscala%2Fcollection%2Fconvert%2FImplicitConversionsToJava%24.htmldata=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166sdata=BpMYD30%2B2tXeaoIj0nNhlho8XUZOEYvT%2FzH%2FJ4WEK98%3Dreserved=0
> >
> >
> > On Tue, 28 Apr 2020 08:52:57 +0900
> > Hyukjin Kwon  wrote:
> >
> > > I would like to make sure I am open for other options that can be
> > > considered situationally and based on the context.
> > > It's okay, and I don't target to restrict this here. For example, DSv2, I
> > > understand it's written in Java because Java
> > > interfaces arguably brings better performance. That's why vectorized
> > > readers are written in Java too.
> > >
> > > Maybe the "general" wasn't explicit in my previous email. Adding APIs to
> > > return a Java instance is still
> > > rather rare in general given my few years monitoring.
> > > The problem I would more like to deal with is more about when we need to
> > > add one or a couple of user-facing
> > > Java-specific APIs to return Java instances, which is relatively more
> > > frequent compared to when we need a bunch
> > > of Java specific APIs.
> > >
> > > In this case, I think it should be guided to use 4. approach. There are
> > > pros and cons between 3. and 4., of course.
> > > But it looks to me 4. approach is closer to what Spark has targeted so
> > far.
> > >
> > >
> > >
> > > 2020년 4월 28일 (화) 오전 8:34, Hyukjin Kwon 님이 작성:
> > >
> > > > > One thing we could do here is use Java collections internally and
> > make
> > > > the Scala API a thin wrapper around Java -- like how Python works.
> > > > > Then adding a method to the Scala API would require adding it to the
> > > > Java API and we would keep the two more in sync.
> > > >
> > > > I think it can be an appropriate idea for when we have to deal with
> > this
> > > > case a lot but I don't think there are so many
> > > > user-facing APIs to return a Java collections, it's rather rare. Also,
> > the
> > > > Java users are relatively less than Scala users.
> > > > This case is slightly different from Python in a way that there are so
> > > > many differences to deal with in PySpark case.
> > > >
> > > > Also, in case of `Seq`, actually we can just use `Array` instead for

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-27 Thread Hyukjin Kwon
The problem is that calling Scala instances in Java side is discouraged in
general up to my best knowledge.
A Java user won't likely know asJava in Scala but a Scala user will likely
know both asScala and asJava.


2020년 4월 28일 (화) 오전 11:35, ZHANG Wei 님이 작성:

> How about making a small change on option 4:
>   Keep Scala API returning Scala type instance with providing a
>   `asJava` method to return a Java type instance.
>
> Scala 2.13 has provided CollectionConverter [1][2][3], in the following
> Spark dependences upgrade, which can be supported by nature. For
> current Scala 2.12 version, we can wrap `ImplicitConversionsToJava`[4]
> as what Scala 2.13 does and add implicit conversions.
>
> Just my 2 cents.
>
> --
> Cheers,
> -z
>
> [1]
> https://docs.scala-lang.org/overviews/collections-2.13/conversions-between-java-and-scala-collections.html
> [2]
> https://www.scala-lang.org/api/2.13.0/scala/jdk/javaapi/CollectionConverters$.html
> [3]
> https://www.scala-lang.org/api/2.13.0/scala/jdk/CollectionConverters$.html
> [4]
> https://www.scala-lang.org/api/2.12.11/scala/collection/convert/ImplicitConversionsToJava$.html
>
>
> On Tue, 28 Apr 2020 08:52:57 +0900
> Hyukjin Kwon  wrote:
>
> > I would like to make sure I am open for other options that can be
> > considered situationally and based on the context.
> > It's okay, and I don't target to restrict this here. For example, DSv2, I
> > understand it's written in Java because Java
> > interfaces arguably brings better performance. That's why vectorized
> > readers are written in Java too.
> >
> > Maybe the "general" wasn't explicit in my previous email. Adding APIs to
> > return a Java instance is still
> > rather rare in general given my few years monitoring.
> > The problem I would more like to deal with is more about when we need to
> > add one or a couple of user-facing
> > Java-specific APIs to return Java instances, which is relatively more
> > frequent compared to when we need a bunch
> > of Java specific APIs.
> >
> > In this case, I think it should be guided to use 4. approach. There are
> > pros and cons between 3. and 4., of course.
> > But it looks to me 4. approach is closer to what Spark has targeted so
> far.
> >
> >
> >
> > 2020년 4월 28일 (화) 오전 8:34, Hyukjin Kwon 님이 작성:
> >
> > > > One thing we could do here is use Java collections internally and
> make
> > > the Scala API a thin wrapper around Java -- like how Python works.
> > > > Then adding a method to the Scala API would require adding it to the
> > > Java API and we would keep the two more in sync.
> > >
> > > I think it can be an appropriate idea for when we have to deal with
> this
> > > case a lot but I don't think there are so many
> > > user-facing APIs to return a Java collections, it's rather rare. Also,
> the
> > > Java users are relatively less than Scala users.
> > > This case is slightly different from Python in a way that there are so
> > > many differences to deal with in PySpark case.
> > >
> > > Also, in case of `Seq`, actually we can just use `Array` instead for
> both
> > > Scala and Java side simply. I don't find such cases notably awkward.
> > > This problematic cases might be specific to few Java collections or
> > > instances, and I would like to avoid an overkill here.
> > >
> > > Of course, if there is a place to consider other options, let's do. I
> > > don't like to say this is the only required option.
> > >
> > >
> > >
> > >
> > >
> > > 2020년 4월 28일 (화) 오전 1:18, Ryan Blue 님이 작성:
> > >
> > >> I think the right choice here depends on how the object is used. For
> > >> developer and internal APIs, I think standardizing on Java collections
> > >> makes the most sense.
> > >>
> > >> For user-facing APIs, it is awkward to return Java collections to
> Scala
> > >> code -- I think that's the motivation for Tom's comment. For user
> APIs, I
> > >> think most methods should return Scala collections, and I don't have a
> > >> strong opinion about whether the conversion (or lack thereof) is done
> in a
> > >> separate object (#1) or in parallel methods (#3).
> > >>
> > >> Both #1 and #3 seem like about the same amount of work and have the
> same
> > >> likelihood that a developer will leave out a Java method version. One
> thing
> > >> we could do here is use Java collections internally and make the
> Scala API
> > >> a thin wrapper around Java -- like how Python works. Then adding a
> method
> > >> to the Scala API would require adding it to the Java API and we would
> keep
> > >> the two more in sync. It would also help avoid Scala collections
> leaking
> > >> into internals.
> > >>
> > >> On Mon, Apr 27, 2020 at 8:49 AM Hyukjin Kwon 
> wrote:
> > >>
> > >>> Let's stick to the less maintenance efforts then rather than we
> leave it
> > >>> undecided and delay with leaving this inconsistency.
> > >>>
> > >>> I dont think we can have some very meaningful data about this soon
> given
> > >>> that we don't hear much complaints about this in general so far.
> > >>>
> > >>> 

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-27 Thread ZHANG Wei
How about making a small change on option 4:
  Keep Scala API returning Scala type instance with providing a
  `asJava` method to return a Java type instance.

Scala 2.13 has provided CollectionConverter [1][2][3], in the following
Spark dependences upgrade, which can be supported by nature. For
current Scala 2.12 version, we can wrap `ImplicitConversionsToJava`[4]
as what Scala 2.13 does and add implicit conversions.

Just my 2 cents.

-- 
Cheers,
-z

[1] 
https://docs.scala-lang.org/overviews/collections-2.13/conversions-between-java-and-scala-collections.html
[2] 
https://www.scala-lang.org/api/2.13.0/scala/jdk/javaapi/CollectionConverters$.html
[3] https://www.scala-lang.org/api/2.13.0/scala/jdk/CollectionConverters$.html
[4] 
https://www.scala-lang.org/api/2.12.11/scala/collection/convert/ImplicitConversionsToJava$.html


On Tue, 28 Apr 2020 08:52:57 +0900
Hyukjin Kwon  wrote:

> I would like to make sure I am open for other options that can be
> considered situationally and based on the context.
> It's okay, and I don't target to restrict this here. For example, DSv2, I
> understand it's written in Java because Java
> interfaces arguably brings better performance. That's why vectorized
> readers are written in Java too.
> 
> Maybe the "general" wasn't explicit in my previous email. Adding APIs to
> return a Java instance is still
> rather rare in general given my few years monitoring.
> The problem I would more like to deal with is more about when we need to
> add one or a couple of user-facing
> Java-specific APIs to return Java instances, which is relatively more
> frequent compared to when we need a bunch
> of Java specific APIs.
> 
> In this case, I think it should be guided to use 4. approach. There are
> pros and cons between 3. and 4., of course.
> But it looks to me 4. approach is closer to what Spark has targeted so far.
> 
> 
> 
> 2020년 4월 28일 (화) 오전 8:34, Hyukjin Kwon 님이 작성:
> 
> > > One thing we could do here is use Java collections internally and make
> > the Scala API a thin wrapper around Java -- like how Python works.
> > > Then adding a method to the Scala API would require adding it to the
> > Java API and we would keep the two more in sync.
> >
> > I think it can be an appropriate idea for when we have to deal with this
> > case a lot but I don't think there are so many
> > user-facing APIs to return a Java collections, it's rather rare. Also, the
> > Java users are relatively less than Scala users.
> > This case is slightly different from Python in a way that there are so
> > many differences to deal with in PySpark case.
> >
> > Also, in case of `Seq`, actually we can just use `Array` instead for both
> > Scala and Java side simply. I don't find such cases notably awkward.
> > This problematic cases might be specific to few Java collections or
> > instances, and I would like to avoid an overkill here.
> >
> > Of course, if there is a place to consider other options, let's do. I
> > don't like to say this is the only required option.
> >
> >
> >
> >
> >
> > 2020년 4월 28일 (화) 오전 1:18, Ryan Blue 님이 작성:
> >
> >> I think the right choice here depends on how the object is used. For
> >> developer and internal APIs, I think standardizing on Java collections
> >> makes the most sense.
> >>
> >> For user-facing APIs, it is awkward to return Java collections to Scala
> >> code -- I think that's the motivation for Tom's comment. For user APIs, I
> >> think most methods should return Scala collections, and I don't have a
> >> strong opinion about whether the conversion (or lack thereof) is done in a
> >> separate object (#1) or in parallel methods (#3).
> >>
> >> Both #1 and #3 seem like about the same amount of work and have the same
> >> likelihood that a developer will leave out a Java method version. One thing
> >> we could do here is use Java collections internally and make the Scala API
> >> a thin wrapper around Java -- like how Python works. Then adding a method
> >> to the Scala API would require adding it to the Java API and we would keep
> >> the two more in sync. It would also help avoid Scala collections leaking
> >> into internals.
> >>
> >> On Mon, Apr 27, 2020 at 8:49 AM Hyukjin Kwon  wrote:
> >>
> >>> Let's stick to the less maintenance efforts then rather than we leave it
> >>> undecided and delay with leaving this inconsistency.
> >>>
> >>> I dont think we can have some very meaningful data about this soon given
> >>> that we don't hear much complaints about this in general so far.
> >>>
> >>> The point of this thread is to make a call rather then defer to the
> >>> future.
> >>>
> >>> On Mon, 27 Apr 2020, 23:15 Wenchen Fan,  wrote:
> >>>
>  IIUC We are moving away from having 2 classes for Java and Scala, like
>  JavaRDD and RDD. It's much simpler to maintain and use with a single 
>  class.
> 
>  I don't have a strong preference over option 3 or 4. We may need to
>  collect more data points from actual users.
> 
>  On Mon, Apr 27, 2020 at 

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-27 Thread Hyukjin Kwon
I would like to make sure I am open for other options that can be
considered situationally and based on the context.
It's okay, and I don't target to restrict this here. For example, DSv2, I
understand it's written in Java because Java
interfaces arguably brings better performance. That's why vectorized
readers are written in Java too.

Maybe the "general" wasn't explicit in my previous email. Adding APIs to
return a Java instance is still
rather rare in general given my few years monitoring.
The problem I would more like to deal with is more about when we need to
add one or a couple of user-facing
Java-specific APIs to return Java instances, which is relatively more
frequent compared to when we need a bunch
of Java specific APIs.

In this case, I think it should be guided to use 4. approach. There are
pros and cons between 3. and 4., of course.
But it looks to me 4. approach is closer to what Spark has targeted so far.



2020년 4월 28일 (화) 오전 8:34, Hyukjin Kwon 님이 작성:

> > One thing we could do here is use Java collections internally and make
> the Scala API a thin wrapper around Java -- like how Python works.
> > Then adding a method to the Scala API would require adding it to the
> Java API and we would keep the two more in sync.
>
> I think it can be an appropriate idea for when we have to deal with this
> case a lot but I don't think there are so many
> user-facing APIs to return a Java collections, it's rather rare. Also, the
> Java users are relatively less than Scala users.
> This case is slightly different from Python in a way that there are so
> many differences to deal with in PySpark case.
>
> Also, in case of `Seq`, actually we can just use `Array` instead for both
> Scala and Java side simply. I don't find such cases notably awkward.
> This problematic cases might be specific to few Java collections or
> instances, and I would like to avoid an overkill here.
>
> Of course, if there is a place to consider other options, let's do. I
> don't like to say this is the only required option.
>
>
>
>
>
> 2020년 4월 28일 (화) 오전 1:18, Ryan Blue 님이 작성:
>
>> I think the right choice here depends on how the object is used. For
>> developer and internal APIs, I think standardizing on Java collections
>> makes the most sense.
>>
>> For user-facing APIs, it is awkward to return Java collections to Scala
>> code -- I think that's the motivation for Tom's comment. For user APIs, I
>> think most methods should return Scala collections, and I don't have a
>> strong opinion about whether the conversion (or lack thereof) is done in a
>> separate object (#1) or in parallel methods (#3).
>>
>> Both #1 and #3 seem like about the same amount of work and have the same
>> likelihood that a developer will leave out a Java method version. One thing
>> we could do here is use Java collections internally and make the Scala API
>> a thin wrapper around Java -- like how Python works. Then adding a method
>> to the Scala API would require adding it to the Java API and we would keep
>> the two more in sync. It would also help avoid Scala collections leaking
>> into internals.
>>
>> On Mon, Apr 27, 2020 at 8:49 AM Hyukjin Kwon  wrote:
>>
>>> Let's stick to the less maintenance efforts then rather than we leave it
>>> undecided and delay with leaving this inconsistency.
>>>
>>> I dont think we can have some very meaningful data about this soon given
>>> that we don't hear much complaints about this in general so far.
>>>
>>> The point of this thread is to make a call rather then defer to the
>>> future.
>>>
>>> On Mon, 27 Apr 2020, 23:15 Wenchen Fan,  wrote:
>>>
 IIUC We are moving away from having 2 classes for Java and Scala, like
 JavaRDD and RDD. It's much simpler to maintain and use with a single class.

 I don't have a strong preference over option 3 or 4. We may need to
 collect more data points from actual users.

 On Mon, Apr 27, 2020 at 9:50 PM Hyukjin Kwon 
 wrote:

> Scala users are arguably more prevailing compared to Java users, yes.
> Using the Java instances in Scala side is legitimate, and they are
> already being used in multiple please. I don't believe Scala
> users find this not Scala friendly as it's legitimate and already
> being used. I personally find it's more trouble some to let Java
> users to search which APIs to call. Yes, I understand the pros and
> cons - we should also find the balance considering the actual usage.
>
> One more argument from me is, though, I think one of the goals in
> Spark APIs is the unified API set up to my knowledge
>  e.g., JavaRDD <> RDD vs DataFrame.
> If either way is not particularly preferred over the other, I would
> just choose the one to have the unified API set.
>
>
>
> 2020년 4월 27일 (월) 오후 10:37, Tom Graves 님이 작성:
>
>> I agree a general guidance is good so we keep consistent in the apis.
>> I don't necessarily agree that 4 is the best solution though.  I 

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-27 Thread Hyukjin Kwon
> One thing we could do here is use Java collections internally and make
the Scala API a thin wrapper around Java -- like how Python works.
> Then adding a method to the Scala API would require adding it to the Java
API and we would keep the two more in sync.

I think it can be an appropriate idea for when we have to deal with this
case a lot but I don't think there are so many
user-facing APIs to return a Java collections, it's rather rare. Also, the
Java users are relatively less than Scala users.
This case is slightly different from Python in a way that there are so many
differences to deal with in PySpark case.

Also, in case of `Seq`, actually we can just use `Array` instead for both
Scala and Java side simply. I don't find such cases notably awkward.
This problematic cases might be specific to few Java collections or
instances, and I would like to avoid an overkill here.

Of course, if there is a place to consider other options, let's do. I don't
like to say this is the only required option.





2020년 4월 28일 (화) 오전 1:18, Ryan Blue 님이 작성:

> I think the right choice here depends on how the object is used. For
> developer and internal APIs, I think standardizing on Java collections
> makes the most sense.
>
> For user-facing APIs, it is awkward to return Java collections to Scala
> code -- I think that's the motivation for Tom's comment. For user APIs, I
> think most methods should return Scala collections, and I don't have a
> strong opinion about whether the conversion (or lack thereof) is done in a
> separate object (#1) or in parallel methods (#3).
>
> Both #1 and #3 seem like about the same amount of work and have the same
> likelihood that a developer will leave out a Java method version. One thing
> we could do here is use Java collections internally and make the Scala API
> a thin wrapper around Java -- like how Python works. Then adding a method
> to the Scala API would require adding it to the Java API and we would keep
> the two more in sync. It would also help avoid Scala collections leaking
> into internals.
>
> On Mon, Apr 27, 2020 at 8:49 AM Hyukjin Kwon  wrote:
>
>> Let's stick to the less maintenance efforts then rather than we leave it
>> undecided and delay with leaving this inconsistency.
>>
>> I dont think we can have some very meaningful data about this soon given
>> that we don't hear much complaints about this in general so far.
>>
>> The point of this thread is to make a call rather then defer to the
>> future.
>>
>> On Mon, 27 Apr 2020, 23:15 Wenchen Fan,  wrote:
>>
>>> IIUC We are moving away from having 2 classes for Java and Scala, like
>>> JavaRDD and RDD. It's much simpler to maintain and use with a single class.
>>>
>>> I don't have a strong preference over option 3 or 4. We may need to
>>> collect more data points from actual users.
>>>
>>> On Mon, Apr 27, 2020 at 9:50 PM Hyukjin Kwon 
>>> wrote:
>>>
 Scala users are arguably more prevailing compared to Java users, yes.
 Using the Java instances in Scala side is legitimate, and they are
 already being used in multiple please. I don't believe Scala
 users find this not Scala friendly as it's legitimate and already being
 used. I personally find it's more trouble some to let Java
 users to search which APIs to call. Yes, I understand the pros and cons
 - we should also find the balance considering the actual usage.

 One more argument from me is, though, I think one of the goals in Spark
 APIs is the unified API set up to my knowledge
  e.g., JavaRDD <> RDD vs DataFrame.
 If either way is not particularly preferred over the other, I would
 just choose the one to have the unified API set.



 2020년 4월 27일 (월) 오후 10:37, Tom Graves 님이 작성:

> I agree a general guidance is good so we keep consistent in the apis.
> I don't necessarily agree that 4 is the best solution though.  I agree its
> nice to have one api, but it is less friendly for the scala side.
> Searching for the equivalent Java api shouldn't be hard as it should be
> very close in the name and if we make it a general rule users should
> understand it.   I guess one good question is what API do most of our 
> users
> use between Java and Scala and what is the ratio?  I don't know the answer
> to that. I've seen more using Scala over Java.  If the majority use Scala
> then I think the API should be more friendly to that.
>
> Tom
>
> On Monday, April 27, 2020, 04:04:28 AM CDT, Hyukjin Kwon <
> gurwls...@gmail.com> wrote:
>
>
> Hi all,
>
> I would like to discuss Java specific APIs and which design we will
> choose.
> This has been discussed in multiple places so far, for example, at
> https://github.com/apache/spark/pull/28085#discussion_r407334754
>
>
> *The problem:*
>
> In short, I would like us to have clear guidance on how we support
> Java specific APIs when
> it requires 

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-27 Thread Ryan Blue
I think the right choice here depends on how the object is used. For
developer and internal APIs, I think standardizing on Java collections
makes the most sense.

For user-facing APIs, it is awkward to return Java collections to Scala
code -- I think that's the motivation for Tom's comment. For user APIs, I
think most methods should return Scala collections, and I don't have a
strong opinion about whether the conversion (or lack thereof) is done in a
separate object (#1) or in parallel methods (#3).

Both #1 and #3 seem like about the same amount of work and have the same
likelihood that a developer will leave out a Java method version. One thing
we could do here is use Java collections internally and make the Scala API
a thin wrapper around Java -- like how Python works. Then adding a method
to the Scala API would require adding it to the Java API and we would keep
the two more in sync. It would also help avoid Scala collections leaking
into internals.

On Mon, Apr 27, 2020 at 8:49 AM Hyukjin Kwon  wrote:

> Let's stick to the less maintenance efforts then rather than we leave it
> undecided and delay with leaving this inconsistency.
>
> I dont think we can have some very meaningful data about this soon given
> that we don't hear much complaints about this in general so far.
>
> The point of this thread is to make a call rather then defer to the future.
>
> On Mon, 27 Apr 2020, 23:15 Wenchen Fan,  wrote:
>
>> IIUC We are moving away from having 2 classes for Java and Scala, like
>> JavaRDD and RDD. It's much simpler to maintain and use with a single class.
>>
>> I don't have a strong preference over option 3 or 4. We may need to
>> collect more data points from actual users.
>>
>> On Mon, Apr 27, 2020 at 9:50 PM Hyukjin Kwon  wrote:
>>
>>> Scala users are arguably more prevailing compared to Java users, yes.
>>> Using the Java instances in Scala side is legitimate, and they are
>>> already being used in multiple please. I don't believe Scala
>>> users find this not Scala friendly as it's legitimate and already being
>>> used. I personally find it's more trouble some to let Java
>>> users to search which APIs to call. Yes, I understand the pros and cons
>>> - we should also find the balance considering the actual usage.
>>>
>>> One more argument from me is, though, I think one of the goals in Spark
>>> APIs is the unified API set up to my knowledge
>>>  e.g., JavaRDD <> RDD vs DataFrame.
>>> If either way is not particularly preferred over the other, I would just
>>> choose the one to have the unified API set.
>>>
>>>
>>>
>>> 2020년 4월 27일 (월) 오후 10:37, Tom Graves 님이 작성:
>>>
 I agree a general guidance is good so we keep consistent in the apis. I
 don't necessarily agree that 4 is the best solution though.  I agree its
 nice to have one api, but it is less friendly for the scala side.
 Searching for the equivalent Java api shouldn't be hard as it should be
 very close in the name and if we make it a general rule users should
 understand it.   I guess one good question is what API do most of our users
 use between Java and Scala and what is the ratio?  I don't know the answer
 to that. I've seen more using Scala over Java.  If the majority use Scala
 then I think the API should be more friendly to that.

 Tom

 On Monday, April 27, 2020, 04:04:28 AM CDT, Hyukjin Kwon <
 gurwls...@gmail.com> wrote:


 Hi all,

 I would like to discuss Java specific APIs and which design we will
 choose.
 This has been discussed in multiple places so far, for example, at
 https://github.com/apache/spark/pull/28085#discussion_r407334754


 *The problem:*

 In short, I would like us to have clear guidance on how we support Java
 specific APIs when
 it requires to return a Java instance. The problem is simple:

 def requests: Map[String, ExecutorResourceRequest] = ...
 def requestsJMap: java.util.Map[String, ExecutorResourceRequest] = ...

 vs

 def requests: java.util.Map[String, ExecutorResourceRequest] = ...


 *Current codebase:*

 My understanding so far was that the latter is preferred and more
 consistent and prevailing in the
 existing codebase, for example, see StateOperatorProgress and
 StreamingQueryProgress in Structured Streaming.
 However, I realised that we also have other approaches in the current
 codebase. There look
 four approaches to deal with Java specifics in general:

1. Java specific classes such as JavaRDD and JavaSparkContext.
2. Java specific methods with the same name that overload its
parameters, see functions.scala.
3. Java specific methods with a different name that needs to return
a different type such as TaskContext.resourcesJMap vs
TaskContext.resources.
4. One method that returns a Java instance for both Scala and Java
sides. see 

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-27 Thread Hyukjin Kwon
Let's stick to the less maintenance efforts then rather than we leave it
undecided and delay with leaving this inconsistency.

I dont think we can have some very meaningful data about this soon given
that we don't hear much complaints about this in general so far.

The point of this thread is to make a call rather then defer to the future.

On Mon, 27 Apr 2020, 23:15 Wenchen Fan,  wrote:

> IIUC We are moving away from having 2 classes for Java and Scala, like
> JavaRDD and RDD. It's much simpler to maintain and use with a single class.
>
> I don't have a strong preference over option 3 or 4. We may need to
> collect more data points from actual users.
>
> On Mon, Apr 27, 2020 at 9:50 PM Hyukjin Kwon  wrote:
>
>> Scala users are arguably more prevailing compared to Java users, yes.
>> Using the Java instances in Scala side is legitimate, and they are
>> already being used in multiple please. I don't believe Scala
>> users find this not Scala friendly as it's legitimate and already being
>> used. I personally find it's more trouble some to let Java
>> users to search which APIs to call. Yes, I understand the pros and cons -
>> we should also find the balance considering the actual usage.
>>
>> One more argument from me is, though, I think one of the goals in Spark
>> APIs is the unified API set up to my knowledge
>>  e.g., JavaRDD <> RDD vs DataFrame.
>> If either way is not particularly preferred over the other, I would just
>> choose the one to have the unified API set.
>>
>>
>>
>> 2020년 4월 27일 (월) 오후 10:37, Tom Graves 님이 작성:
>>
>>> I agree a general guidance is good so we keep consistent in the apis. I
>>> don't necessarily agree that 4 is the best solution though.  I agree its
>>> nice to have one api, but it is less friendly for the scala side.
>>> Searching for the equivalent Java api shouldn't be hard as it should be
>>> very close in the name and if we make it a general rule users should
>>> understand it.   I guess one good question is what API do most of our users
>>> use between Java and Scala and what is the ratio?  I don't know the answer
>>> to that. I've seen more using Scala over Java.  If the majority use Scala
>>> then I think the API should be more friendly to that.
>>>
>>> Tom
>>>
>>> On Monday, April 27, 2020, 04:04:28 AM CDT, Hyukjin Kwon <
>>> gurwls...@gmail.com> wrote:
>>>
>>>
>>> Hi all,
>>>
>>> I would like to discuss Java specific APIs and which design we will
>>> choose.
>>> This has been discussed in multiple places so far, for example, at
>>> https://github.com/apache/spark/pull/28085#discussion_r407334754
>>>
>>>
>>> *The problem:*
>>>
>>> In short, I would like us to have clear guidance on how we support Java
>>> specific APIs when
>>> it requires to return a Java instance. The problem is simple:
>>>
>>> def requests: Map[String, ExecutorResourceRequest] = ...
>>> def requestsJMap: java.util.Map[String, ExecutorResourceRequest] = ...
>>>
>>> vs
>>>
>>> def requests: java.util.Map[String, ExecutorResourceRequest] = ...
>>>
>>>
>>> *Current codebase:*
>>>
>>> My understanding so far was that the latter is preferred and more
>>> consistent and prevailing in the
>>> existing codebase, for example, see StateOperatorProgress and
>>> StreamingQueryProgress in Structured Streaming.
>>> However, I realised that we also have other approaches in the current
>>> codebase. There look
>>> four approaches to deal with Java specifics in general:
>>>
>>>1. Java specific classes such as JavaRDD and JavaSparkContext.
>>>2. Java specific methods with the same name that overload its
>>>parameters, see functions.scala.
>>>3. Java specific methods with a different name that needs to return
>>>a different type such as TaskContext.resourcesJMap vs
>>>TaskContext.resources.
>>>4. One method that returns a Java instance for both Scala and Java
>>>sides. see StateOperatorProgress and StreamingQueryProgress.
>>>
>>>
>>> *Analysis on the current codebase:*
>>>
>>> I agree with 2. approach because the corresponding cases give you a
>>> consistent API usage across
>>> other language APIs in general. Approach 1. is from the old world when
>>> we didn't have unified APIs.
>>> This might be the worst approach.
>>>
>>> 3. and 4. are controversial.
>>>
>>> For 3., if you have to use Java APIs, then, you should search if there
>>> is a variant of that API
>>> every time specifically for Java APIs. But yes, it gives you Java/Scala
>>> friendly instances.
>>>
>>> For 4., having one API that returns a Java instance makes you able to
>>> use it in both Scala and Java APIs
>>> sides although it makes you call asScala in Scala side specifically.
>>> But you don’t
>>> have to search if there’s a variant of this API and it gives you a
>>> consistent API usage across languages.
>>>
>>> Also, note that calling Java in Scala is legitimate but the opposite
>>> case is not, up to my best knowledge.
>>> In addition, you should have a method that returns a Java instance for
>>> PySpark or 

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-27 Thread Wenchen Fan
IIUC We are moving away from having 2 classes for Java and Scala, like
JavaRDD and RDD. It's much simpler to maintain and use with a single class.

I don't have a strong preference over option 3 or 4. We may need to collect
more data points from actual users.

On Mon, Apr 27, 2020 at 9:50 PM Hyukjin Kwon  wrote:

> Scala users are arguably more prevailing compared to Java users, yes.
> Using the Java instances in Scala side is legitimate, and they are already
> being used in multiple please. I don't believe Scala
> users find this not Scala friendly as it's legitimate and already being
> used. I personally find it's more trouble some to let Java
> users to search which APIs to call. Yes, I understand the pros and cons -
> we should also find the balance considering the actual usage.
>
> One more argument from me is, though, I think one of the goals in Spark
> APIs is the unified API set up to my knowledge
>  e.g., JavaRDD <> RDD vs DataFrame.
> If either way is not particularly preferred over the other, I would just
> choose the one to have the unified API set.
>
>
>
> 2020년 4월 27일 (월) 오후 10:37, Tom Graves 님이 작성:
>
>> I agree a general guidance is good so we keep consistent in the apis. I
>> don't necessarily agree that 4 is the best solution though.  I agree its
>> nice to have one api, but it is less friendly for the scala side.
>> Searching for the equivalent Java api shouldn't be hard as it should be
>> very close in the name and if we make it a general rule users should
>> understand it.   I guess one good question is what API do most of our users
>> use between Java and Scala and what is the ratio?  I don't know the answer
>> to that. I've seen more using Scala over Java.  If the majority use Scala
>> then I think the API should be more friendly to that.
>>
>> Tom
>>
>> On Monday, April 27, 2020, 04:04:28 AM CDT, Hyukjin Kwon <
>> gurwls...@gmail.com> wrote:
>>
>>
>> Hi all,
>>
>> I would like to discuss Java specific APIs and which design we will
>> choose.
>> This has been discussed in multiple places so far, for example, at
>> https://github.com/apache/spark/pull/28085#discussion_r407334754
>>
>>
>> *The problem:*
>>
>> In short, I would like us to have clear guidance on how we support Java
>> specific APIs when
>> it requires to return a Java instance. The problem is simple:
>>
>> def requests: Map[String, ExecutorResourceRequest] = ...
>> def requestsJMap: java.util.Map[String, ExecutorResourceRequest] = ...
>>
>> vs
>>
>> def requests: java.util.Map[String, ExecutorResourceRequest] = ...
>>
>>
>> *Current codebase:*
>>
>> My understanding so far was that the latter is preferred and more
>> consistent and prevailing in the
>> existing codebase, for example, see StateOperatorProgress and
>> StreamingQueryProgress in Structured Streaming.
>> However, I realised that we also have other approaches in the current
>> codebase. There look
>> four approaches to deal with Java specifics in general:
>>
>>1. Java specific classes such as JavaRDD and JavaSparkContext.
>>2. Java specific methods with the same name that overload its
>>parameters, see functions.scala.
>>3. Java specific methods with a different name that needs to return a
>>different type such as TaskContext.resourcesJMap vs
>>TaskContext.resources.
>>4. One method that returns a Java instance for both Scala and Java
>>sides. see StateOperatorProgress and StreamingQueryProgress.
>>
>>
>> *Analysis on the current codebase:*
>>
>> I agree with 2. approach because the corresponding cases give you a
>> consistent API usage across
>> other language APIs in general. Approach 1. is from the old world when we
>> didn't have unified APIs.
>> This might be the worst approach.
>>
>> 3. and 4. are controversial.
>>
>> For 3., if you have to use Java APIs, then, you should search if there is
>> a variant of that API
>> every time specifically for Java APIs. But yes, it gives you Java/Scala
>> friendly instances.
>>
>> For 4., having one API that returns a Java instance makes you able to use
>> it in both Scala and Java APIs
>> sides although it makes you call asScala in Scala side specifically. But
>> you don’t
>> have to search if there’s a variant of this API and it gives you a
>> consistent API usage across languages.
>>
>> Also, note that calling Java in Scala is legitimate but the opposite case
>> is not, up to my best knowledge.
>> In addition, you should have a method that returns a Java instance for
>> PySpark or SparkR to support.
>>
>>
>> *Proposal:*
>>
>> I would like to have a general guidance on this that the Spark dev agrees
>> upon: Do 4. approach. If not possible, do 3. Avoid 1 almost at all cost.
>>
>> Note that this isn't a hard requirement but *a general guidance*;
>> therefore, the decision might be up to
>> the specific context. For example, when there are some strong arguments
>> to have a separate Java specific API, that’s fine.
>> Of course, we won’t change the existing methods given Micheal’s rubric

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-27 Thread Hyukjin Kwon
Scala users are arguably more prevailing compared to Java users, yes.
Using the Java instances in Scala side is legitimate, and they are already
being used in multiple please. I don't believe Scala
users find this not Scala friendly as it's legitimate and already being
used. I personally find it's more trouble some to let Java
users to search which APIs to call. Yes, I understand the pros and cons -
we should also find the balance considering the actual usage.

One more argument from me is, though, I think one of the goals in Spark
APIs is the unified API set up to my knowledge
 e.g., JavaRDD <> RDD vs DataFrame.
If either way is not particularly preferred over the other, I would just
choose the one to have the unified API set.



2020년 4월 27일 (월) 오후 10:37, Tom Graves 님이 작성:

> I agree a general guidance is good so we keep consistent in the apis. I
> don't necessarily agree that 4 is the best solution though.  I agree its
> nice to have one api, but it is less friendly for the scala side.
> Searching for the equivalent Java api shouldn't be hard as it should be
> very close in the name and if we make it a general rule users should
> understand it.   I guess one good question is what API do most of our users
> use between Java and Scala and what is the ratio?  I don't know the answer
> to that. I've seen more using Scala over Java.  If the majority use Scala
> then I think the API should be more friendly to that.
>
> Tom
>
> On Monday, April 27, 2020, 04:04:28 AM CDT, Hyukjin Kwon <
> gurwls...@gmail.com> wrote:
>
>
> Hi all,
>
> I would like to discuss Java specific APIs and which design we will choose.
> This has been discussed in multiple places so far, for example, at
> https://github.com/apache/spark/pull/28085#discussion_r407334754
>
>
> *The problem:*
>
> In short, I would like us to have clear guidance on how we support Java
> specific APIs when
> it requires to return a Java instance. The problem is simple:
>
> def requests: Map[String, ExecutorResourceRequest] = ...
> def requestsJMap: java.util.Map[String, ExecutorResourceRequest] = ...
>
> vs
>
> def requests: java.util.Map[String, ExecutorResourceRequest] = ...
>
>
> *Current codebase:*
>
> My understanding so far was that the latter is preferred and more
> consistent and prevailing in the
> existing codebase, for example, see StateOperatorProgress and
> StreamingQueryProgress in Structured Streaming.
> However, I realised that we also have other approaches in the current
> codebase. There look
> four approaches to deal with Java specifics in general:
>
>1. Java specific classes such as JavaRDD and JavaSparkContext.
>2. Java specific methods with the same name that overload its
>parameters, see functions.scala.
>3. Java specific methods with a different name that needs to return a
>different type such as TaskContext.resourcesJMap vs
>TaskContext.resources.
>4. One method that returns a Java instance for both Scala and Java
>sides. see StateOperatorProgress and StreamingQueryProgress.
>
>
> *Analysis on the current codebase:*
>
> I agree with 2. approach because the corresponding cases give you a
> consistent API usage across
> other language APIs in general. Approach 1. is from the old world when we
> didn't have unified APIs.
> This might be the worst approach.
>
> 3. and 4. are controversial.
>
> For 3., if you have to use Java APIs, then, you should search if there is
> a variant of that API
> every time specifically for Java APIs. But yes, it gives you Java/Scala
> friendly instances.
>
> For 4., having one API that returns a Java instance makes you able to use
> it in both Scala and Java APIs
> sides although it makes you call asScala in Scala side specifically. But
> you don’t
> have to search if there’s a variant of this API and it gives you a
> consistent API usage across languages.
>
> Also, note that calling Java in Scala is legitimate but the opposite case
> is not, up to my best knowledge.
> In addition, you should have a method that returns a Java instance for
> PySpark or SparkR to support.
>
>
> *Proposal:*
>
> I would like to have a general guidance on this that the Spark dev agrees
> upon: Do 4. approach. If not possible, do 3. Avoid 1 almost at all cost.
>
> Note that this isn't a hard requirement but *a general guidance*;
> therefore, the decision might be up to
> the specific context. For example, when there are some strong arguments to
> have a separate Java specific API, that’s fine.
> Of course, we won’t change the existing methods given Micheal’s rubric
> added before. I am talking about new
> methods in unreleased branches.
>
> Any concern or opinion on this?
>


Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-27 Thread Tom Graves
 I agree a general guidance is good so we keep consistent in the apis. I don't 
necessarily agree that 4 is the best solution though.  I agree its nice to have 
one api, but it is less friendly for the scala side.  Searching for the 
equivalent Java api shouldn't be hard as it should be very close in the name 
and if we make it a general rule users should understand it.   I guess one good 
question is what API do most of our users use between Java and Scala and what 
is the ratio?  I don't know the answer to that. I've seen more using Scala over 
Java.  If the majority use Scala then I think the API should be more friendly 
to that.
Tom
On Monday, April 27, 2020, 04:04:28 AM CDT, Hyukjin Kwon 
 wrote:  
 
 
Hi all,

I would like to discuss Java specific APIs and which design we will choose.
This has been discussed in multiple places so far, for example, at
https://github.com/apache/spark/pull/28085#discussion_r407334754


The problem:

In short, I would like us to have clear guidance on how we support Java 
specific APIs when
it requires to return a Java instance. The problem is simple:
def requests: Map[String, ExecutorResourceRequest] = ...
def requestsJMap: java.util.Map[String, ExecutorResourceRequest] = ...

vs
def requests: java.util.Map[String, ExecutorResourceRequest] = ...


Current codebase:

My understanding so far was that the latter is preferred and more consistent 
and prevailing in the
existing codebase, for example, see StateOperatorProgress and 
StreamingQueryProgress in Structured Streaming.
However, I realised that we also have other approaches in the current codebase. 
There look
four approaches to deal with Java specifics in general:
   
   - Java specific classes such as JavaRDD and JavaSparkContext.
   - Java specific methods with the same name that overload its parameters, see 
functions.scala.
   - Java specific methods with a different name that needs to return a 
different type such as TaskContext.resourcesJMap vs  TaskContext.resources.
   - One method that returns a Java instance for both Scala and Java sides. see 
StateOperatorProgress and StreamingQueryProgress.   



Analysis on the current codebase:

I agree with 2. approach because the corresponding cases give you a consistent 
API usage across
other language APIs in general. Approach 1. is from the old world when we 
didn't have unified APIs.
This might be the worst approach.

3. and 4. are controversial.

For 3., if you have to use Java APIs, then, you should search if there is a 
variant of that API
every time specifically for Java APIs. But yes, it gives you Java/Scala 
friendly instances.

For 4., having one API that returns a Java instance makes you able to use it in 
both Scala and Java APIs
sides although it makes you call asScala in Scala side specifically. But you 
don’t
have to search if there’s a variant of this API and it gives you a consistent 
API usage across languages.

Also, note that calling Java in Scala is legitimate but the opposite case is 
not, up to my best knowledge.
In addition, you should have a method that returns a Java instance for PySpark 
or SparkR to support.


Proposal:

I would like to have a general guidance on this that the Spark dev agrees upon: 
Do 4. approach. If not possible, do 3. Avoid 1 almost at all cost.

Note that this isn't a hard requirement but a general guidance; therefore, the 
decision might be up to
the specific context. For example, when there are some strong arguments to have 
a separate Java specific API, that’s fine.
Of course, we won’t change the existing methods given Micheal’s rubric added 
before. I am talking about new
methods in unreleased branches.

Any concern or opinion on this?
  

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-27 Thread Sean Owen
The guidance sounds fine, if the general message is 'keep it simple'.
The right approach might be pretty situational. For example RDD has a
lot of methods that need a Java variant. Putting all the overloads in
one class might be harder to figure out than making a separate return
type with those methods, JavaRDD. (also recall that all of the
overloads show up, for example, in docs and auto-complete, which means
both Java and Scala users have to pick out which one is appropriate).

On Mon, Apr 27, 2020 at 4:04 AM Hyukjin Kwon  wrote:
>
> Hi all,
>
> I would like to discuss Java specific APIs and which design we will choose.
> This has been discussed in multiple places so far, for example, at
> https://github.com/apache/spark/pull/28085#discussion_r407334754
>
>
> The problem:
>
> In short, I would like us to have clear guidance on how we support Java 
> specific APIs when
> it requires to return a Java instance. The problem is simple:
>
> def requests: Map[String, ExecutorResourceRequest] = ...
> def requestsJMap: java.util.Map[String, ExecutorResourceRequest] = ...
>
> vs
>
> def requests: java.util.Map[String, ExecutorResourceRequest] = ...
>
>
> Current codebase:
>
> My understanding so far was that the latter is preferred and more consistent 
> and prevailing in the
> existing codebase, for example, see StateOperatorProgress and 
> StreamingQueryProgress in Structured Streaming.
> However, I realised that we also have other approaches in the current 
> codebase. There look
> four approaches to deal with Java specifics in general:
>
> Java specific classes such as JavaRDD and JavaSparkContext.
> Java specific methods with the same name that overload its parameters, see 
> functions.scala.
> Java specific methods with a different name that needs to return a different 
> type such as TaskContext.resourcesJMap vs  TaskContext.resources.
> One method that returns a Java instance for both Scala and Java sides. see 
> StateOperatorProgress and StreamingQueryProgress.
>
>
> Analysis on the current codebase:
>
> I agree with 2. approach because the corresponding cases give you a 
> consistent API usage across
> other language APIs in general. Approach 1. is from the old world when we 
> didn't have unified APIs.
> This might be the worst approach.
>
> 3. and 4. are controversial.
>
> For 3., if you have to use Java APIs, then, you should search if there is a 
> variant of that API
> every time specifically for Java APIs. But yes, it gives you Java/Scala 
> friendly instances.
>
> For 4., having one API that returns a Java instance makes you able to use it 
> in both Scala and Java APIs
> sides although it makes you call asScala in Scala side specifically. But you 
> don’t
> have to search if there’s a variant of this API and it gives you a consistent 
> API usage across languages.
>
> Also, note that calling Java in Scala is legitimate but the opposite case is 
> not, up to my best knowledge.
> In addition, you should have a method that returns a Java instance for 
> PySpark or SparkR to support.
>
>
> Proposal:
>
> I would like to have a general guidance on this that the Spark dev agrees 
> upon: Do 4. approach. If not possible, do 3. Avoid 1 almost at all cost.
>
> Note that this isn't a hard requirement but a general guidance; therefore, 
> the decision might be up to
> the specific context. For example, when there are some strong arguments to 
> have a separate Java specific API, that’s fine.
> Of course, we won’t change the existing methods given Micheal’s rubric added 
> before. I am talking about new
> methods in unreleased branches.
>
> Any concern or opinion on this?

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org