Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-11 Thread Wenchen Fan
SPARK-30098 was merged about 6 months ago. It's not a clean revert and we
may need to spend quite a bit of time to resolve conflicts and fix tests.

I don't see why it's still a problem if a feature is disabled and hidden
from end-users (it's undocumented, the config is internal). The related
code will be replaced in the master branch sooner or later, when we unify
the syntaxes.



On Tue, May 12, 2020 at 6:16 AM Ryan Blue  wrote:

> I'm all for getting the unified syntax into master. The only issue appears
> to be whether or not to pass the presence of the EXTERNAL keyword through
> to a catalog in v2. Maybe it's time to start a discuss thread for that
> issue so we're not stuck for another 6 weeks on it.
>
> On Mon, May 11, 2020 at 3:13 PM Jungtaek Lim 
> wrote:
>
>> Btw another wondering here is, is it good to retain the flag on master as
>> an intermediate step? Wouldn't it be better for us to start "unified create
>> table syntax" from scratch?
>>
>>
>> On Tue, May 12, 2020 at 6:50 AM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> I'm sorry, but I have to agree with Ryan and Russell. I chose the option
>>> 1 because it's less worse than option 2, but it doesn't mean I fully agree
>>> with option 1.
>>>
>>> Let's make below things clear if we really go with option 1, otherwise
>>> please consider reverting it.
>>>
>>> * Do you fully indicate about "all" the paths where the second create
>>> table syntax is taken?
>>> * Could you explain "why" to end users without any confusion? Do you
>>> think end users will understand it easily?
>>> * Do you have an actual end users to guide to turn this on? Or do you
>>> have a plan to turn this on for your team/customers and deal with
>>> the ambiguity?
>>> * Could you please document about how things will change if the flag is
>>> turned on?
>>>
>>> I guess the option 1 is to leave a flag as "undocumented" one and forget
>>> about the path to turn on, but I think that would lead to make the
>>> feature be "broken window" even we are not able to touch.
>>>
>>> On Tue, May 12, 2020 at 6:45 AM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
 I think reverting 30098 is the right decision here if we want to
 unblock 3.0. We shouldn't ship with features which we know do not function
 in the way we intend, regardless of how little exposure most users have to
 them. Even if it's off my default, we should probably work to avoid
 switches that cause things to behave unpredictably or require a flow chart
 to actually determine what will happen.

 On Mon, May 11, 2020 at 3:07 PM Ryan Blue 
 wrote:

> I'm all for fixing behavior in master by turning this off as an
> intermediate step, but I don't think that Spark 3.0 can safely include
> SPARK-30098.
>
> The problem is that SPARK-30098 introduces strange behavior, as
> Jungtaek pointed out. And that behavior is not fully understood. While
> working on a unified CREATE TABLE syntax, I hit additional test
> failures
> 
> where the wrong create path was being used.
>
> Unless we plan to NOT support the behavior
> when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we
> should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to 
> deal
> with this problem for years to come.
>
> On Mon, May 11, 2020 at 1:06 AM JackyLee  wrote:
>
>> +1. Agree with Xiao Li and Jungtaek Lim.
>>
>> This seems to be controversial, and can not be done in a short time.
>> It is
>> necessary to choose option 1 to unblock Spark 3.0 and support it in
>> 3.1.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


ASF board report draft for May

2020-05-11 Thread Matei Zaharia
Hi all,

Our quarterly project board report needs to be submitted on May 13th, and I 
wanted to include anything notable going on that we want to appear in the board 
archive. Here is my draft below — let me know if you have suggested changes.

===

Apache Spark is a fast and general engine for large-scale data processing. It 
offers high-level APIs in Java, Scala, Python and R as well as a rich set of 
libraries including stream processing, machine learning, and graph analytics.

Project status:

- Progress is continuing on the upcoming Apache Spark 3.0 release, with the 
first votes on release candidates. This will be a major release with various 
API and SQL language updates, so we’ve tried to solicit broad input on it 
through two preview releases and a lot of JIRA and mailing list discussion.

- The community is also voting on a release candidate for Apache Spark 2.4.6, 
bringing bug fixes to the 2.4 branch.

Trademarks:

- Nothing new to report in the past 3 months.

Latest releases:

- Spark 2.4.5 was released on Feb 8th, 2020.
- Spark 3.0.0-preview2 was released on Dec 23rd, 2019.
- Spark 3.0.0-preview was released on Nov 6th, 2019.
- Spark 2.3.4 was released on Sept 9th, 2019.

Committers and PMC:

- The latest PMC member was added on Sept 4th, 2019 (Dongjoon Hyun).
- The latest committer was added on Sept 9th, 2019 (Weichen Xu). We also added
Ryan Blue, L.C. Hsieh, Gengliang Wang, Yuming Wang and Ruifeng Zheng as
committers in the past three months.
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-11 Thread Ryan Blue
I'm all for getting the unified syntax into master. The only issue appears
to be whether or not to pass the presence of the EXTERNAL keyword through
to a catalog in v2. Maybe it's time to start a discuss thread for that
issue so we're not stuck for another 6 weeks on it.

On Mon, May 11, 2020 at 3:13 PM Jungtaek Lim 
wrote:

> Btw another wondering here is, is it good to retain the flag on master as
> an intermediate step? Wouldn't it be better for us to start "unified create
> table syntax" from scratch?
>
>
> On Tue, May 12, 2020 at 6:50 AM Jungtaek Lim 
> wrote:
>
>> I'm sorry, but I have to agree with Ryan and Russell. I chose the option
>> 1 because it's less worse than option 2, but it doesn't mean I fully agree
>> with option 1.
>>
>> Let's make below things clear if we really go with option 1, otherwise
>> please consider reverting it.
>>
>> * Do you fully indicate about "all" the paths where the second create
>> table syntax is taken?
>> * Could you explain "why" to end users without any confusion? Do you
>> think end users will understand it easily?
>> * Do you have an actual end users to guide to turn this on? Or do you
>> have a plan to turn this on for your team/customers and deal with
>> the ambiguity?
>> * Could you please document about how things will change if the flag is
>> turned on?
>>
>> I guess the option 1 is to leave a flag as "undocumented" one and forget
>> about the path to turn on, but I think that would lead to make the
>> feature be "broken window" even we are not able to touch.
>>
>> On Tue, May 12, 2020 at 6:45 AM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> I think reverting 30098 is the right decision here if we want to unblock
>>> 3.0. We shouldn't ship with features which we know do not function in the
>>> way we intend, regardless of how little exposure most users have to them.
>>> Even if it's off my default, we should probably work to avoid switches that
>>> cause things to behave unpredictably or require a flow chart to actually
>>> determine what will happen.
>>>
>>> On Mon, May 11, 2020 at 3:07 PM Ryan Blue 
>>> wrote:
>>>
 I'm all for fixing behavior in master by turning this off as an
 intermediate step, but I don't think that Spark 3.0 can safely include
 SPARK-30098.

 The problem is that SPARK-30098 introduces strange behavior, as
 Jungtaek pointed out. And that behavior is not fully understood. While
 working on a unified CREATE TABLE syntax, I hit additional test
 failures
 
 where the wrong create path was being used.

 Unless we plan to NOT support the behavior
 when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we
 should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal
 with this problem for years to come.

 On Mon, May 11, 2020 at 1:06 AM JackyLee  wrote:

> +1. Agree with Xiao Li and Jungtaek Lim.
>
> This seems to be controversial, and can not be done in a short time.
> It is
> necessary to choose option 1 to unblock Spark 3.0 and support it in
> 3.1.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

 --
 Ryan Blue
 Software Engineer
 Netflix

>>>

-- 
Ryan Blue
Software Engineer
Netflix


unsubscribe

2020-05-11 Thread Chenguang He
unsubscribe

-- 

Chenguang He


Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-11 Thread Jungtaek Lim
Btw another wondering here is, is it good to retain the flag on master as
an intermediate step? Wouldn't it be better for us to start "unified create
table syntax" from scratch?


On Tue, May 12, 2020 at 6:50 AM Jungtaek Lim 
wrote:

> I'm sorry, but I have to agree with Ryan and Russell. I chose the option 1
> because it's less worse than option 2, but it doesn't mean I fully agree
> with option 1.
>
> Let's make below things clear if we really go with option 1, otherwise
> please consider reverting it.
>
> * Do you fully indicate about "all" the paths where the second create
> table syntax is taken?
> * Could you explain "why" to end users without any confusion? Do you think
> end users will understand it easily?
> * Do you have an actual end users to guide to turn this on? Or do you have
> a plan to turn this on for your team/customers and deal with the ambiguity?
> * Could you please document about how things will change if the flag is
> turned on?
>
> I guess the option 1 is to leave a flag as "undocumented" one and forget
> about the path to turn on, but I think that would lead to make the
> feature be "broken window" even we are not able to touch.
>
> On Tue, May 12, 2020 at 6:45 AM Russell Spitzer 
> wrote:
>
>> I think reverting 30098 is the right decision here if we want to unblock
>> 3.0. We shouldn't ship with features which we know do not function in the
>> way we intend, regardless of how little exposure most users have to them.
>> Even if it's off my default, we should probably work to avoid switches that
>> cause things to behave unpredictably or require a flow chart to actually
>> determine what will happen.
>>
>> On Mon, May 11, 2020 at 3:07 PM Ryan Blue 
>> wrote:
>>
>>> I'm all for fixing behavior in master by turning this off as an
>>> intermediate step, but I don't think that Spark 3.0 can safely include
>>> SPARK-30098.
>>>
>>> The problem is that SPARK-30098 introduces strange behavior, as Jungtaek
>>> pointed out. And that behavior is not fully understood. While working on a
>>> unified CREATE TABLE syntax, I hit additional test failures
>>> 
>>> where the wrong create path was being used.
>>>
>>> Unless we plan to NOT support the behavior
>>> when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we
>>> should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal
>>> with this problem for years to come.
>>>
>>> On Mon, May 11, 2020 at 1:06 AM JackyLee  wrote:
>>>
 +1. Agree with Xiao Li and Jungtaek Lim.

 This seems to be controversial, and can not be done in a short time. It
 is
 necessary to choose option 1 to unblock Spark 3.0 and support it in 3.1.



 --
 Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>


Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-11 Thread Jungtaek Lim
I'm sorry, but I have to agree with Ryan and Russell. I chose the option 1
because it's less worse than option 2, but it doesn't mean I fully agree
with option 1.

Let's make below things clear if we really go with option 1, otherwise
please consider reverting it.

* Do you fully indicate about "all" the paths where the second create table
syntax is taken?
* Could you explain "why" to end users without any confusion? Do you think
end users will understand it easily?
* Do you have an actual end users to guide to turn this on? Or do you have
a plan to turn this on for your team/customers and deal with the ambiguity?
* Could you please document about how things will change if the flag is
turned on?

I guess the option 1 is to leave a flag as "undocumented" one and forget
about the path to turn on, but I think that would lead to make the
feature be "broken window" even we are not able to touch.

On Tue, May 12, 2020 at 6:45 AM Russell Spitzer 
wrote:

> I think reverting 30098 is the right decision here if we want to unblock
> 3.0. We shouldn't ship with features which we know do not function in the
> way we intend, regardless of how little exposure most users have to them.
> Even if it's off my default, we should probably work to avoid switches that
> cause things to behave unpredictably or require a flow chart to actually
> determine what will happen.
>
> On Mon, May 11, 2020 at 3:07 PM Ryan Blue 
> wrote:
>
>> I'm all for fixing behavior in master by turning this off as an
>> intermediate step, but I don't think that Spark 3.0 can safely include
>> SPARK-30098.
>>
>> The problem is that SPARK-30098 introduces strange behavior, as Jungtaek
>> pointed out. And that behavior is not fully understood. While working on a
>> unified CREATE TABLE syntax, I hit additional test failures
>> 
>> where the wrong create path was being used.
>>
>> Unless we plan to NOT support the behavior
>> when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we
>> should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal
>> with this problem for years to come.
>>
>> On Mon, May 11, 2020 at 1:06 AM JackyLee  wrote:
>>
>>> +1. Agree with Xiao Li and Jungtaek Lim.
>>>
>>> This seems to be controversial, and can not be done in a short time. It
>>> is
>>> necessary to choose option 1 to unblock Spark 3.0 and support it in 3.1.
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>


Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-11 Thread Russell Spitzer
I think reverting 30098 is the right decision here if we want to unblock
3.0. We shouldn't ship with features which we know do not function in the
way we intend, regardless of how little exposure most users have to them.
Even if it's off my default, we should probably work to avoid switches that
cause things to behave unpredictably or require a flow chart to actually
determine what will happen.

On Mon, May 11, 2020 at 3:07 PM Ryan Blue  wrote:

> I'm all for fixing behavior in master by turning this off as an
> intermediate step, but I don't think that Spark 3.0 can safely include
> SPARK-30098.
>
> The problem is that SPARK-30098 introduces strange behavior, as Jungtaek
> pointed out. And that behavior is not fully understood. While working on a
> unified CREATE TABLE syntax, I hit additional test failures
>  where
> the wrong create path was being used.
>
> Unless we plan to NOT support the behavior
> when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we
> should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal
> with this problem for years to come.
>
> On Mon, May 11, 2020 at 1:06 AM JackyLee  wrote:
>
>> +1. Agree with Xiao Li and Jungtaek Lim.
>>
>> This seems to be controversial, and can not be done in a short time. It is
>> necessary to choose option 1 to unblock Spark 3.0 and support it in 3.1.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re:Re: Re:Re: Screen Shot 2020-05-11 at 5.28.03 AM

2020-05-11 Thread zhangliyun



  Hi all:
  thanks for your reply. the job is hang as 20+ hours, The history server has 
deleted the log. I will monitor and   l try to use thread dump to try to find 
something.


Best Regards


Kelly Zhang











At 2020-05-11 15:41:29, "ZHANG Wei"  wrote:
>Sometimes, the Thread dump result table of Spark UI can provide some clues to 
>find out thread locks issue, such as:
>
>  Thread ID | Thread Name  | Thread State | Thread Locks
>  13| NonBlockingInputStreamThread | WAITING  | Blocked by Thread 
> Some(48) Lock(jline.internal.NonBlockingInputStream@103008951})
>  48| Thread-16| RUNNABLE | 
> Monitor(jline.internal.NonBlockingInputStream@103008951})
>
>And echo thread row can show the call stacks after being clicked, such as this 
>case, for thread 48, there are the function which is holding the lock:
>
>  org.fusesource.jansi.internal.Kernel32.ReadConsoleInputW(Native Method)
>  
> org.fusesource.jansi.internal.Kernel32.readConsoleInputHelper(Kernel32.java:811)
>  org.fusesource.jansi.internal.Kernel32.readConsoleKeyInput(Kernel32.java:842)
>  
> org.fusesource.jansi.internal.WindowsSupport.readConsoleInput(WindowsSupport.java:97)
>  jline.WindowsTerminal.readConsoleInput(WindowsTerminal.java:222)
>  
>
>Cheers,
>-z
>
>
>From: zhangliyun 
>Sent: Monday, May 11, 2020 9:44
>To: Russell Spitzer; Spark Dev List
>Subject: Re:Re: Screen Shot 2020-05-11 at 5.28.03 AM
>
>
>Hi
>
> appreciate your reply
> i guess you want me to see the executor page,  i go to the page, if the 
> deadlock, will the thread_state states "Dead Lock" ?  which clue i should use 
> to find the
>reason why there are running tasks but actually not have.
>[cid:3f7815ce$1$17201673b40$Coremail$kellyzly$126.com]
>
>
>
>
>At 2020-05-11 08:55:25, "Russell Spitzer"  wrote:
>
>Have you checked the executor thread dumps? It may give you some insight if 
>there is a deadlock or something else.
>
>They should be available under the executor tab on the ui
>
>On Sun, May 10, 2020, 4:43 PM zhangliyun 
>mailto:kelly...@126.com>> wrote:
>Hi all:
>   i have a spark 2.3.1  job stuck for 23 hours , when i go to spark history 
> server. it shows that 5039 tasks in totally 5043 tasks have been finished. so 
>  it means there are 4 still running. but when i go to tasks page, there is no 
> running tasks.  I have downloaded logs, wants to grep "Dropping event from 
> queue" stdout , there is no result for that. seems this stuck is not caused 
> by " spark.scheduler.listenerbus.eventqueue.capacity " is not big.   
> Appreciate if you can give me some suggestion to find the reason why the job 
> is stuck?
>[cid:54633c54$1$1720089560a$Coremail$kellyzly$126.com]
>  there is no running tasks in the running stage
>[cid:789f4af8$2$1720089560a$Coremail$kellyzly$126.com][cid:54633c54$1$1720089560a$Coremail$kellyzly$126.com]
>
>-
>To unsubscribe e-mail: 
>dev-unsubscr...@spark.apache.org
>
>
>
>


Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-11 Thread Ryan Blue
I'm all for fixing behavior in master by turning this off as an
intermediate step, but I don't think that Spark 3.0 can safely include
SPARK-30098.

The problem is that SPARK-30098 introduces strange behavior, as Jungtaek
pointed out. And that behavior is not fully understood. While working on a
unified CREATE TABLE syntax, I hit additional test failures
 where
the wrong create path was being used.

Unless we plan to NOT support the behavior
when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we
should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal
with this problem for years to come.

On Mon, May 11, 2020 at 1:06 AM JackyLee  wrote:

> +1. Agree with Xiao Li and Jungtaek Lim.
>
> This seems to be controversial, and can not be done in a short time. It is
> necessary to choose option 1 to unblock Spark 3.0 and support it in 3.1.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: [DISCUSS] Java specific APIs design concern and choice

2020-05-11 Thread Hyukjin Kwon
Had a short sync with Tom. I am going to postpone this for now since this
case is very unlikely - I have seen this twice for the last 5 years.
We'll go for a vote when we happen to see this more, and make a decision
based on the feedback in the vote thread.


2020년 5월 11일 (월) 오후 11:08, Hyukjin Kwon 님이 작성:

> The guide is our official guide, see "Code Style Guide" in
> http://spark.apache.org/contributing.html.
> As I said this is a general guidance, instead of a hard strict policy. I
> don't target to change existing APIs either.
> I would like to not start the vote when I see the clear objection to
> address, Tom. I would like to address it.
>
> > So as I've already stated and it looks like 2 others have issues with
> number 4 as written as well, I'm against you posting this as is.  I do not
> think we should recommend 4 for public user facing Scala API
>
> The main argument from you looks Scala/Java friendly (and Java users are
> smaller than Scala).
> The first argument is not quite correct because using Java is in the
> official Scala guide. As I mentioned, it is not awkward if we use `Array`
> for both Scala and Java as an example.
> Such cases are very few, and seems it's best to stick to what Spark has
> been done to support a single API for both Scala and Java.
>
>
> 2020년 5월 11일 (월) 오후 10:45, Tom Graves 님이 작성:
>
>> So as I've already stated and it looks like 2 others have issues with
>> number 4 as written as well, I'm against you posting this as is.  I do not
>> think we should recommend 4 for public user facing Scala API.
>>
>> Also note the page you linked is a Databricks page, while I know we
>> reference it as a style guide, I do not believe we should be putting API
>> policy on that page, it should live on an Apache Spark page.
>>
>> I think if you want to implement an API policy like this it should go
>> through an official vote thread, not just a discuss thread where we have
>> not had a lot of feedback on it.
>>
>> Tom
>>
>>
>>
>> On Monday, May 11, 2020, 06:44:31 AM CDT, Hyukjin Kwon <
>> gurwls...@gmail.com> wrote:
>>
>>
>> I will wait a couple of more days and if there's no objection I hear, I
>> will document this at
>> https://github.com/databricks/scala-style-guide#java-interoperability.
>>
>> 2020년 5월 7일 (목) 오후 9:18, Hyukjin Kwon 님이 작성:
>>
>> Hi all, I would like to proceed this. Are there more thoughts on this? If
>> not, I would like to go ahead with the proposal here.
>>
>> 2020년 4월 30일 (목) 오후 10:54, Hyukjin Kwon 님이 작성:
>>
>> Nothing is urgent. I just don't want to leave it undecided and just keep
>> adding Java APIs inconsistently as it's currently happening.
>>
>> We should have a set of coherent APIs. It's very difficult to change APIs
>> once they are out in releases. I guess I have seen people here agree with
>> having a general guidance for the same reason at least - please let me know
>> if I'm taking it wrong.
>>
>> I don't think we should assume Java programmers know how Scala works with
>> Java types. Less assumtion might be better.
>>
>> I feel like we have things on the table to consider at this moment and
>> not much point of waiting indefinitely.
>>
>> But sure maybe I am wrong. We can wait for more feedback for a couple of
>> days.
>>
>>
>> On Thu, 30 Apr 2020, 18:59 ZHANG Wei,  wrote:
>>
>> I feel a little pushed... :-) I still don't get the point of why it's
>> urgent to make the decision now. AFAIK, it's a common practice to handle
>> Scala types conversions by self when Java programmers prepare to
>> invoke Scala libraries. I'm not sure which one is the Java programmers'
>> root complaint, Scala type instance or Scala Jar file.
>>
>> My 2 cents.
>>
>> --
>> Cheers,
>> -z
>>
>> On Thu, 30 Apr 2020 09:17:37 +0900
>> Hyukjin Kwon  wrote:
>>
>> > There was a typo in the previous email. I am re-sending:
>> >
>> > Hm, I thought you meant you prefer 3. over 4 but don't mind
>> particularly.
>> > I don't mean to wait for more feedback. It looks likely just a deadlock
>> > which will be the worst case.
>> > I was suggesting to pick one way first, and stick to it. If we find out
>> > something later, we can discuss
>> > more about changing it later.
>> >
>> > Having separate Java specific API (3. way)
>> >   - causes maintenance cost
>> >   - makes users to search which API for Java every time
>> >   - this looks the opposite why against the unified API set Spark
>> targeted
>> > so far.
>> >
>> > I don't completely buy the argument about Scala/Java friendly because
>> using
>> > Java instance is already documented in the official Scala documentation.
>> > Users still need to search if we have Java specific methods for *some*
>> APIs.
>> >
>> > 2020년 4월 30일 (목) 오전 8:58, Hyukjin Kwon 님이 작성:
>> >
>> > > Hm, I thought you meant you prefer 3. over 4 but don't mind
>> particularly.
>> > > I don't mean to wait for more feedback. It looks likely just a
>> deadlock
>> > > which will be the worst case.
>> > > I was suggesting to pick one way first, and stick to it. If we find
>>

Re: [DISCUSS] Java specific APIs design concern and choice

2020-05-11 Thread Hyukjin Kwon
The guide is our official guide, see "Code Style Guide" in
http://spark.apache.org/contributing.html.
As I said this is a general guidance, instead of a hard strict policy. I
don't target to change existing APIs either.
I would like to not start the vote when I see the clear objection to
address, Tom. I would like to address it.

> So as I've already stated and it looks like 2 others have issues with
number 4 as written as well, I'm against you posting this as is.  I do not
think we should recommend 4 for public user facing Scala API

The main argument from you looks Scala/Java friendly (and Java users are
smaller than Scala).
The first argument is not quite correct because using Java is in the
official Scala guide. As I mentioned, it is not awkward if we use `Array`
for both Scala and Java as an example.
Such cases are very few, and seems it's best to stick to what Spark has
been done to support a single API for both Scala and Java.


2020년 5월 11일 (월) 오후 10:45, Tom Graves 님이 작성:

> So as I've already stated and it looks like 2 others have issues with
> number 4 as written as well, I'm against you posting this as is.  I do not
> think we should recommend 4 for public user facing Scala API.
>
> Also note the page you linked is a Databricks page, while I know we
> reference it as a style guide, I do not believe we should be putting API
> policy on that page, it should live on an Apache Spark page.
>
> I think if you want to implement an API policy like this it should go
> through an official vote thread, not just a discuss thread where we have
> not had a lot of feedback on it.
>
> Tom
>
>
>
> On Monday, May 11, 2020, 06:44:31 AM CDT, Hyukjin Kwon <
> gurwls...@gmail.com> wrote:
>
>
> I will wait a couple of more days and if there's no objection I hear, I
> will document this at
> https://github.com/databricks/scala-style-guide#java-interoperability.
>
> 2020년 5월 7일 (목) 오후 9:18, Hyukjin Kwon 님이 작성:
>
> Hi all, I would like to proceed this. Are there more thoughts on this? If
> not, I would like to go ahead with the proposal here.
>
> 2020년 4월 30일 (목) 오후 10:54, Hyukjin Kwon 님이 작성:
>
> Nothing is urgent. I just don't want to leave it undecided and just keep
> adding Java APIs inconsistently as it's currently happening.
>
> We should have a set of coherent APIs. It's very difficult to change APIs
> once they are out in releases. I guess I have seen people here agree with
> having a general guidance for the same reason at least - please let me know
> if I'm taking it wrong.
>
> I don't think we should assume Java programmers know how Scala works with
> Java types. Less assumtion might be better.
>
> I feel like we have things on the table to consider at this moment and not
> much point of waiting indefinitely.
>
> But sure maybe I am wrong. We can wait for more feedback for a couple of
> days.
>
>
> On Thu, 30 Apr 2020, 18:59 ZHANG Wei,  wrote:
>
> I feel a little pushed... :-) I still don't get the point of why it's
> urgent to make the decision now. AFAIK, it's a common practice to handle
> Scala types conversions by self when Java programmers prepare to
> invoke Scala libraries. I'm not sure which one is the Java programmers'
> root complaint, Scala type instance or Scala Jar file.
>
> My 2 cents.
>
> --
> Cheers,
> -z
>
> On Thu, 30 Apr 2020 09:17:37 +0900
> Hyukjin Kwon  wrote:
>
> > There was a typo in the previous email. I am re-sending:
> >
> > Hm, I thought you meant you prefer 3. over 4 but don't mind particularly.
> > I don't mean to wait for more feedback. It looks likely just a deadlock
> > which will be the worst case.
> > I was suggesting to pick one way first, and stick to it. If we find out
> > something later, we can discuss
> > more about changing it later.
> >
> > Having separate Java specific API (3. way)
> >   - causes maintenance cost
> >   - makes users to search which API for Java every time
> >   - this looks the opposite why against the unified API set Spark
> targeted
> > so far.
> >
> > I don't completely buy the argument about Scala/Java friendly because
> using
> > Java instance is already documented in the official Scala documentation.
> > Users still need to search if we have Java specific methods for *some*
> APIs.
> >
> > 2020년 4월 30일 (목) 오전 8:58, Hyukjin Kwon 님이 작성:
> >
> > > Hm, I thought you meant you prefer 3. over 4 but don't mind
> particularly.
> > > I don't mean to wait for more feedback. It looks likely just a deadlock
> > > which will be the worst case.
> > > I was suggesting to pick one way first, and stick to it. If we find out
> > > something later, we can discuss
> > > more about changing it later.
> > >
> > > Having separate Java specific API (4. way)
> > >   - causes maintenance cost
> > >   - makes users to search which API for Java every time
> > >   - this looks the opposite why against the unified API set Spark
> targeted
> > > so far.
> > >
> > > I don't completely buy the argument about Scala/Java friendly because
> > > using Java instance is already docum

Re: [DISCUSS] Java specific APIs design concern and choice

2020-05-11 Thread Tom Graves
 So as I've already stated and it looks like 2 others have issues with number 4 
as written as well, I'm against you posting this as is.  I do not think we 
should recommend 4 for public user facing Scala API.
Also note the page you linked is a Databricks page, while I know we reference 
it as a style guide, I do not believe we should be putting API policy on that 
page, it should live on an Apache Spark page.
I think if you want to implement an API policy like this it should go through 
an official vote thread, not just a discuss thread where we have not had a lot 
of feedback on it.
Tom


On Monday, May 11, 2020, 06:44:31 AM CDT, Hyukjin Kwon 
 wrote:  
 
 I will wait a couple of more days and if there's no objection I hear, I will 
document this at 
https://github.com/databricks/scala-style-guide#java-interoperability.
2020년 5월 7일 (목) 오후 9:18, Hyukjin Kwon 님이 작성:

Hi all, I would like to proceed this. Are there more thoughts on this? If not, 
I would like to go ahead with the proposal here.

2020년 4월 30일 (목) 오후 10:54, Hyukjin Kwon 님이 작성:
Nothing is urgent. I just don't want to leave it undecided and just keep adding 
Java APIs inconsistently as it's currently happening.
We should have a set of coherent APIs. It's very difficult to change APIs once 
they are out in releases. I guess I have seen people here agree with having a 
general guidance for the same reason at least - please let me know if I'm 
taking it wrong.
I don't think we should assume Java programmers know how Scala works with Java 
types. Less assumtion might be better.
I feel like we have things on the table to consider at this moment and not much 
point of waiting indefinitely.
But sure maybe I am wrong. We can wait for more feedback for a couple of days.

On Thu, 30 Apr 2020, 18:59 ZHANG Wei,  wrote:

I feel a little pushed... :-) I still don't get the point of why it's
urgent to make the decision now. AFAIK, it's a common practice to handle
Scala types conversions by self when Java programmers prepare to
invoke Scala libraries. I'm not sure which one is the Java programmers'
root complaint, Scala type instance or Scala Jar file.

My 2 cents.

-- 
Cheers,
-z

On Thu, 30 Apr 2020 09:17:37 +0900
Hyukjin Kwon  wrote:

> There was a typo in the previous email. I am re-sending:
> 
> Hm, I thought you meant you prefer 3. over 4 but don't mind particularly.
> I don't mean to wait for more feedback. It looks likely just a deadlock
> which will be the worst case.
> I was suggesting to pick one way first, and stick to it. If we find out
> something later, we can discuss
> more about changing it later.
> 
> Having separate Java specific API (3. way)
>   - causes maintenance cost
>   - makes users to search which API for Java every time
>   - this looks the opposite why against the unified API set Spark targeted
> so far.
> 
> I don't completely buy the argument about Scala/Java friendly because using
> Java instance is already documented in the official Scala documentation.
> Users still need to search if we have Java specific methods for *some* APIs.
> 
> 2020년 4월 30일 (목) 오전 8:58, Hyukjin Kwon 님이 작성:
> 
> > Hm, I thought you meant you prefer 3. over 4 but don't mind particularly.
> > I don't mean to wait for more feedback. It looks likely just a deadlock
> > which will be the worst case.
> > I was suggesting to pick one way first, and stick to it. If we find out
> > something later, we can discuss
> > more about changing it later.
> >
> > Having separate Java specific API (4. way)
> >   - causes maintenance cost
> >   - makes users to search which API for Java every time
> >   - this looks the opposite why against the unified API set Spark targeted
> > so far.
> >
> > I don't completely buy the argument about Scala/Java friendly because
> > using Java instance is already documented in the official Scala
> > documentation.
> > Users still need to search if we have Java specific methods for *some*
> > APIs.
> >
> >
> >
> > On Thu, 30 Apr 2020, 00:06 Tom Graves,  wrote:
> >
> >> Sorry I'm not sure what your last email means. Does it mean you are
> >> putting it up for a vote or just waiting to get more feedback?  I disagree
> >> with saying option 4 is the rule but agree having a general rule makes
> >> sense.  I think we need a lot more input to make the rule as it affects the
> >> api's.
> >>
> >> Tom
> >>
> >> On Wednesday, April 29, 2020, 09:53:22 AM CDT, Hyukjin Kwon <
> >> gurwls...@gmail.com> wrote:
> >>
> >>
> >> I think I am not seeing explicit objection here but rather see people
> >> tend to agree with the proposal in general.
> >> I would like to step forward rather than leaving it as a deadlock - the
> >> worst choice here is to postpone and abandon this discussion with this
> >> inconsistency.
> >>
> >> I don't currently target to document this as the cases are rather
> >> rare, and we haven't really documented JavaRDD <> RDD vs DataFrame case as
> >> well.
> >> Let's keep monitoring and see if this discussion thread clarifies thing

Re: [DISCUSS] Java specific APIs design concern and choice

2020-05-11 Thread Hyukjin Kwon
I will wait a couple of more days and if there's no objection I hear, I
will document this at
https://github.com/databricks/scala-style-guide#java-interoperability.

2020년 5월 7일 (목) 오후 9:18, Hyukjin Kwon 님이 작성:

> Hi all, I would like to proceed this. Are there more thoughts on this? If
> not, I would like to go ahead with the proposal here.
>
> 2020년 4월 30일 (목) 오후 10:54, Hyukjin Kwon 님이 작성:
>
>> Nothing is urgent. I just don't want to leave it undecided and just keep
>> adding Java APIs inconsistently as it's currently happening.
>>
>> We should have a set of coherent APIs. It's very difficult to change APIs
>> once they are out in releases. I guess I have seen people here agree with
>> having a general guidance for the same reason at least - please let me know
>> if I'm taking it wrong.
>>
>> I don't think we should assume Java programmers know how Scala works with
>> Java types. Less assumtion might be better.
>>
>> I feel like we have things on the table to consider at this moment and
>> not much point of waiting indefinitely.
>>
>> But sure maybe I am wrong. We can wait for more feedback for a couple of
>> days.
>>
>>
>> On Thu, 30 Apr 2020, 18:59 ZHANG Wei,  wrote:
>>
>>> I feel a little pushed... :-) I still don't get the point of why it's
>>> urgent to make the decision now. AFAIK, it's a common practice to handle
>>> Scala types conversions by self when Java programmers prepare to
>>> invoke Scala libraries. I'm not sure which one is the Java programmers'
>>> root complaint, Scala type instance or Scala Jar file.
>>>
>>> My 2 cents.
>>>
>>> --
>>> Cheers,
>>> -z
>>>
>>> On Thu, 30 Apr 2020 09:17:37 +0900
>>> Hyukjin Kwon  wrote:
>>>
>>> > There was a typo in the previous email. I am re-sending:
>>> >
>>> > Hm, I thought you meant you prefer 3. over 4 but don't mind
>>> particularly.
>>> > I don't mean to wait for more feedback. It looks likely just a deadlock
>>> > which will be the worst case.
>>> > I was suggesting to pick one way first, and stick to it. If we find out
>>> > something later, we can discuss
>>> > more about changing it later.
>>> >
>>> > Having separate Java specific API (3. way)
>>> >   - causes maintenance cost
>>> >   - makes users to search which API for Java every time
>>> >   - this looks the opposite why against the unified API set Spark
>>> targeted
>>> > so far.
>>> >
>>> > I don't completely buy the argument about Scala/Java friendly because
>>> using
>>> > Java instance is already documented in the official Scala
>>> documentation.
>>> > Users still need to search if we have Java specific methods for *some*
>>> APIs.
>>> >
>>> > 2020년 4월 30일 (목) 오전 8:58, Hyukjin Kwon 님이 작성:
>>> >
>>> > > Hm, I thought you meant you prefer 3. over 4 but don't mind
>>> particularly.
>>> > > I don't mean to wait for more feedback. It looks likely just a
>>> deadlock
>>> > > which will be the worst case.
>>> > > I was suggesting to pick one way first, and stick to it. If we find
>>> out
>>> > > something later, we can discuss
>>> > > more about changing it later.
>>> > >
>>> > > Having separate Java specific API (4. way)
>>> > >   - causes maintenance cost
>>> > >   - makes users to search which API for Java every time
>>> > >   - this looks the opposite why against the unified API set Spark
>>> targeted
>>> > > so far.
>>> > >
>>> > > I don't completely buy the argument about Scala/Java friendly because
>>> > > using Java instance is already documented in the official Scala
>>> > > documentation.
>>> > > Users still need to search if we have Java specific methods for
>>> *some*
>>> > > APIs.
>>> > >
>>> > >
>>> > >
>>> > > On Thu, 30 Apr 2020, 00:06 Tom Graves,  wrote:
>>> > >
>>> > >> Sorry I'm not sure what your last email means. Does it mean you are
>>> > >> putting it up for a vote or just waiting to get more feedback?  I
>>> disagree
>>> > >> with saying option 4 is the rule but agree having a general rule
>>> makes
>>> > >> sense.  I think we need a lot more input to make the rule as it
>>> affects the
>>> > >> api's.
>>> > >>
>>> > >> Tom
>>> > >>
>>> > >> On Wednesday, April 29, 2020, 09:53:22 AM CDT, Hyukjin Kwon <
>>> > >> gurwls...@gmail.com> wrote:
>>> > >>
>>> > >>
>>> > >> I think I am not seeing explicit objection here but rather see
>>> people
>>> > >> tend to agree with the proposal in general.
>>> > >> I would like to step forward rather than leaving it as a deadlock -
>>> the
>>> > >> worst choice here is to postpone and abandon this discussion with
>>> this
>>> > >> inconsistency.
>>> > >>
>>> > >> I don't currently target to document this as the cases are rather
>>> > >> rare, and we haven't really documented JavaRDD <> RDD vs DataFrame
>>> case as
>>> > >> well.
>>> > >> Let's keep monitoring and see if this discussion thread clarifies
>>> things
>>> > >> enough in such cases I mentioned.
>>> > >>
>>> > >> Let me know if you guys think differently.
>>> > >>
>>> > >>
>>> > >> 2020년 4월 28일 (화) 오후 5:03, Hyukjin Kwon 님이 작성:
>>> > >>
>>> > >> Spark has targeted to have a 

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-11 Thread JackyLee
+1. Agree with Xiao Li and Jungtaek Lim.

This seems to be controversial, and can not be done in a short time. It is
necessary to choose option 1 to unblock Spark 3.0 and support it in 3.1.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Accessing temp tables

2020-05-11 Thread ML Books
Hi all,

Can someone guide me in accessing temp tables created in spark with
hive/beeline??

Regards,
Vikas


Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-11 Thread Xiao Li
>
> 1. Turn on spark.sql.legacy.createHiveTableByDefault.enabled by default,
> which effectively revert SPARK-30098. The CREATE TABLE syntax is still
> confusing but it's the same as 2.4
> 2. Do not support the v2 CreateTable command if STORE AS/BY or EXTERNAL is
> specified. This gives us more time to think about how to do it in 3.1.
>

I prefer to first turn on *spark.sql.legacy.createHiveTableByDefault.*
*enabled* by default and then start RC2 first.

We still can continue trying option 2, if we can finish it within 10
days. BTW, we still have multiple ongoing discussions about data source v2
APIs. To be honest, most Spark users will not hit these cases in Spark 3.0.
Thus, temporarily blocking a few cases in DSV2 looks reasonable to me. We
can support them in Spark 3.1.

Xiao





On Sun, May 10, 2020 at 9:32 PM Jungtaek Lim 
wrote:

> Let's focus on how to unblock Spark 3.0.0 for now, as other blockers are
> getting resolved.
>
> I'm in favor of option 1 to avoid bring multiple backward incompatible
> changes. Unifying create table would bring backward incompatibility (I'd
> rather say the new syntax should be cleared up ignoring the backward
> compatibility) and we'd be better to not force end users to adopt the
> changes twice.
>
> On Fri, May 8, 2020 at 11:22 PM Wenchen Fan  wrote:
>
>> Hi all,
>>
>> I'd like to bring this up again to share the status and get more
>> feedback. Currently, we all agree to unify the CREATE TABLE syntax by
>> merging the native and Hive-style syntaxes.
>>
>> The unified CREATE TABLE syntax will become the native syntax and there
>> is no Hive-style syntax anymore. This brings several changes:
>> 1. support PARTITION BY (col type, ...). This can't co-exist with PARTITION
>> BY (col, ...), and simply adds partition columns to the end.
>> 2. support SKEWED BY, which just fails
>> 3. support STORE AS/BY, which can't co-exist with USING provider
>> 4. support EXTERNAL as well
>>
>> All the behaviors will remain the same as before, for the builtin
>> catalog. However, the native CREATE TABLE syntax needs to support the v2
>> CreateTable command and we need to translate the new syntax changes to
>> catalog plugin API calls, and we are still working on reaching an agreement
>> about how to do it.
>>
>> To unblock 3.0, I think there are two choices:
>> 1. Turn on spark.sql.legacy.createHiveTableByDefault.enabled by default,
>> which effectively revert SPARK-30098. The CREATE TABLE syntax is still
>> confusing but it's the same as 2.4
>> 2. Do not support the v2 CreateTable command if STORE AS/BY or EXTERNAL is
>> specified. This gives us more time to think about how to do it in 3.1.
>>
>> If you have other ideas, please reply to this thread.
>>
>> Thanks,
>> Wenchen
>>
>> On Thu, Mar 26, 2020 at 7:28 AM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Thanks, filed SPARK-31257
>>> . Thanks again for
>>> looking into this - I'll take a look whenever I get time sooner.
>>>
>>> On Thu, Mar 26, 2020 at 8:06 AM Ryan Blue  wrote:
>>>
 Feel free to open another issue, I just used that one since it
 describes this and doesn't appear to be done.

 On Wed, Mar 25, 2020 at 4:03 PM Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:

> UPDATE: Sorry I just missed the PR (
> https://github.com/apache/spark/pull/28026). I still think it'd be
> nice to avoid recycling the JIRA issue which was resolved before. Shall we
> have a new JIRA issue with linking to SPARK-30098, and set proper 
> priority?
>
> On Thu, Mar 26, 2020 at 7:59 AM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> Would it be better to prioritize this to make sure the change is
>> included in Spark 3.0? (Maybe filing an issue and set as a blocker)
>>
>> Looks like there's consensus that SPARK-30098 brought ambiguous issue
>> which should be fixed (though the consideration of severity seems to be
>> different), and once we notice the issue it would be really odd if we
>> publish it as it is, and try to fix it later (the fix may not be even
>> included in 3.0.x as it might bring behavioral change).
>>
>> On Tue, Mar 24, 2020 at 3:37 PM Wenchen Fan 
>> wrote:
>>
>>> Hi Ryan,
>>>
>>> It's great to hear that you are cleaning up this long-standing mess.
>>> Please let me know if you hit any problems that I can help with.
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> On Sat, Mar 21, 2020 at 3:16 AM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 On Thu, Mar 19, 2020 at 3:46 AM Wenchen Fan 
 wrote:

> 2. PARTITIONED BY colTypeList: I think we can support it in the
> unified syntax. Just make sure it doesn't appear together with 
> PARTITIONED
> BY transformList.
>

 Another side note: Perhaps as part of (or after) unifying the
>