unsubscribe

2020-05-12 Thread 刘杰



Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-12 Thread Jungtaek Lim
Never mind, forget about the dead code. Turned out that reverting "via
manual" can be very easily done - remove config and apply to the tests ->
remove field -> remove the changes into docs. It was considered as
non-trivial because we only consider about "git revert" but there's no
strict rule to do so.

Please take a look at this PR, https://github.com/apache/spark/pull/28517

On Wed, May 13, 2020 at 3:10 AM Russell Spitzer 
wrote:

> I think that the dead code approach, while a bit unpalatable and worse
> than reverting, is probably better than leaving the parameter (even if it
> is hidden)
>
> On Tue, May 12, 2020 at 12:46 PM Ryan Blue  wrote:
>
>> +1 for the approach Jungtaek suggests. That will avoid needing to support
>> behavior that is not well understood with minimal changes.
>>
>> On Tue, May 12, 2020 at 1:45 AM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Before I forget, we'd better not forget to change the doc, as create
>>> table doc looks to represent current syntax which will be incorrect later.
>>>
>>> On Tue, May 12, 2020 at 5:32 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 It's not only for end users, but also for us. Spark itself uses the
 config "true" and "false" in tests and it still brings confusion. We still
 have to deal with both situations.

 I'm wondering how long days it would be needed to revert it cleanly,
 but if we worry about the amount of code change just around the new RC,
 what about make the code dirty (should be fixed soon) but less headache via
 applying traditional (and bad) way?

 Let's just remove the config so that the config cannot be used in any
 way (even in Spark codebase), and set corresponding field in parser to the
 constant value so that no one can modify in any way. This would make the
 dead code by intention which should be cleaned it up later, so let's add
 FIXME comment there so that anyone can take it up for cleaning up the code
 later. (If no one volunteers then I'll probably pick up.)

 That is a bad pattern, but still better as we prevent end users (even
 early adopters) go through the undocumented path in any way, and that will
 be explicitly marked as "should be fixed". This is different from retaining
 config - I don't expect unified create table syntax will be landed in
 bugfix version, so even unified create table syntax can be landed in 3.1.0
 (this is also not guaranteed) the config will live in 3.0.x in any way. If
 we temporarily go dirty way then we can clean up the code in any version,
 even from bugfix version, maybe within a couple of weeks just after 3.0.0
 is released.

 Does it sound valid?

 On Tue, May 12, 2020 at 2:35 PM Wenchen Fan 
 wrote:

> SPARK-30098 was merged about 6 months ago. It's not a clean revert and
> we may need to spend quite a bit of time to resolve conflicts and fix 
> tests.
>
> I don't see why it's still a problem if a feature is disabled and
> hidden from end-users (it's undocumented, the config is internal). The
> related code will be replaced in the master branch sooner or later, when 
> we
> unify the syntaxes.
>
>
>
> On Tue, May 12, 2020 at 6:16 AM Ryan Blue 
> wrote:
>
>> I'm all for getting the unified syntax into master. The only issue
>> appears to be whether or not to pass the presence of the EXTERNAL keyword
>> through to a catalog in v2. Maybe it's time to start a discuss thread for
>> that issue so we're not stuck for another 6 weeks on it.
>>
>> On Mon, May 11, 2020 at 3:13 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Btw another wondering here is, is it good to retain the flag on
>>> master as an intermediate step? Wouldn't it be better for us to
>>> start "unified create table syntax" from scratch?
>>>
>>>
>>> On Tue, May 12, 2020 at 6:50 AM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 I'm sorry, but I have to agree with Ryan and Russell. I chose the
 option 1 because it's less worse than option 2, but it doesn't mean I 
 fully
 agree with option 1.

 Let's make below things clear if we really go with option 1,
 otherwise please consider reverting it.

 * Do you fully indicate about "all" the paths where the second
 create table syntax is taken?
 * Could you explain "why" to end users without any confusion? Do
 you think end users will understand it easily?
 * Do you have an actual end users to guide to turn this on? Or do
 you have a plan to turn this on for your team/customers and deal with
 the ambiguity?
 * Could you please document about how things will change if the
 flag is turned on?

 I guess the option 1 is to leave 

Clarification on Spark code comments

2020-05-12 Thread Neerav Kumar
Hi

I am new to the community so pardon me if my question is framed incorrectly. I 
was going through the Spark code base on GitHub and am confused with comment 
mentioned. In file 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/util/PeriodicRDDCheckpointer.scala
I see the comment says
Example usage:
* {{{
* val (rdd1, rdd2, rdd3, ...) = ...
* val cp = new PeriodicRDDCheckpointer(2, sc)
* cp.update(rdd1)
* rdd1.count();
* // persisted: rdd1
* cp.update(rdd2)
* rdd2.count();
* // persisted: rdd1, rdd2
* // checkpointed: rdd2
* cp.update(rdd3)
* rdd3.count();
* // persisted: rdd1, rdd2, rdd3
* // checkpointed: rdd2 rdd3
* cp.update(rdd4)
* rdd4.count();
* // persisted: rdd2, rdd3, rdd4
* // checkpointed: rdd4
* cp.update(rdd5)
* rdd5.count();
* // persisted: rdd3, rdd4, rdd5
* // checkpointed: rdd4 rdd5
* }}}

The checkpointed value does not make sense for rdd3.count() and rdd5.count(). I 
have crossed out the existing value and included the one I think makes sense. 
Is my understanding incorrect or is it a bug in documentation.

Regards
Neerav

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-12 Thread Jungtaek Lim
It's not only for end users, but also for us. Spark itself uses the config
"true" and "false" in tests and it still brings confusion. We still have to
deal with both situations.

I'm wondering how long days it would be needed to revert it cleanly, but if
we worry about the amount of code change just around the new RC, what about
make the code dirty (should be fixed soon) but less headache via applying
traditional (and bad) way?

Let's just remove the config so that the config cannot be used in any way
(even in Spark codebase), and set corresponding field in parser to the
constant value so that no one can modify in any way. This would make the
dead code by intention which should be cleaned it up later, so let's add
FIXME comment there so that anyone can take it up for cleaning up the code
later. (If no one volunteers then I'll probably pick up.)

That is a bad pattern, but still better as we prevent end users (even early
adopters) go through the undocumented path in any way, and that will be
explicitly marked as "should be fixed". This is different from retaining
config - I don't expect unified create table syntax will be landed in
bugfix version, so even unified create table syntax can be landed in 3.1.0
(this is also not guaranteed) the config will live in 3.0.x in any way. If
we temporarily go dirty way then we can clean up the code in any version,
even from bugfix version, maybe within a couple of weeks just after 3.0.0
is released.

Does it sound valid?

On Tue, May 12, 2020 at 2:35 PM Wenchen Fan  wrote:

> SPARK-30098 was merged about 6 months ago. It's not a clean revert and we
> may need to spend quite a bit of time to resolve conflicts and fix tests.
>
> I don't see why it's still a problem if a feature is disabled and hidden
> from end-users (it's undocumented, the config is internal). The related
> code will be replaced in the master branch sooner or later, when we unify
> the syntaxes.
>
>
>
> On Tue, May 12, 2020 at 6:16 AM Ryan Blue 
> wrote:
>
>> I'm all for getting the unified syntax into master. The only issue
>> appears to be whether or not to pass the presence of the EXTERNAL keyword
>> through to a catalog in v2. Maybe it's time to start a discuss thread for
>> that issue so we're not stuck for another 6 weeks on it.
>>
>> On Mon, May 11, 2020 at 3:13 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Btw another wondering here is, is it good to retain the flag on master
>>> as an intermediate step? Wouldn't it be better for us to start "unified
>>> create table syntax" from scratch?
>>>
>>>
>>> On Tue, May 12, 2020 at 6:50 AM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 I'm sorry, but I have to agree with Ryan and Russell. I chose the
 option 1 because it's less worse than option 2, but it doesn't mean I fully
 agree with option 1.

 Let's make below things clear if we really go with option 1, otherwise
 please consider reverting it.

 * Do you fully indicate about "all" the paths where the second create
 table syntax is taken?
 * Could you explain "why" to end users without any confusion? Do you
 think end users will understand it easily?
 * Do you have an actual end users to guide to turn this on? Or do you
 have a plan to turn this on for your team/customers and deal with
 the ambiguity?
 * Could you please document about how things will change if the flag is
 turned on?

 I guess the option 1 is to leave a flag as "undocumented" one and
 forget about the path to turn on, but I think that would lead to make the
 feature be "broken window" even we are not able to touch.

 On Tue, May 12, 2020 at 6:45 AM Russell Spitzer <
 russell.spit...@gmail.com> wrote:

> I think reverting 30098 is the right decision here if we want to
> unblock 3.0. We shouldn't ship with features which we know do not function
> in the way we intend, regardless of how little exposure most users have to
> them. Even if it's off my default, we should probably work to avoid
> switches that cause things to behave unpredictably or require a flow chart
> to actually determine what will happen.
>
> On Mon, May 11, 2020 at 3:07 PM Ryan Blue 
> wrote:
>
>> I'm all for fixing behavior in master by turning this off as an
>> intermediate step, but I don't think that Spark 3.0 can safely include
>> SPARK-30098.
>>
>> The problem is that SPARK-30098 introduces strange behavior, as
>> Jungtaek pointed out. And that behavior is not fully understood. While
>> working on a unified CREATE TABLE syntax, I hit additional test
>> failures
>> 
>> where the wrong create path was being used.
>>
>> Unless we plan to NOT support the behavior
>> when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we
>> should not ship Spark 3.0 with 

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-12 Thread Jungtaek Lim
Before I forget, we'd better not forget to change the doc, as create table
doc looks to represent current syntax which will be incorrect later.

On Tue, May 12, 2020 at 5:32 PM Jungtaek Lim 
wrote:

> It's not only for end users, but also for us. Spark itself uses the config
> "true" and "false" in tests and it still brings confusion. We still have to
> deal with both situations.
>
> I'm wondering how long days it would be needed to revert it cleanly, but
> if we worry about the amount of code change just around the new RC, what
> about make the code dirty (should be fixed soon) but less headache via
> applying traditional (and bad) way?
>
> Let's just remove the config so that the config cannot be used in any way
> (even in Spark codebase), and set corresponding field in parser to the
> constant value so that no one can modify in any way. This would make the
> dead code by intention which should be cleaned it up later, so let's add
> FIXME comment there so that anyone can take it up for cleaning up the code
> later. (If no one volunteers then I'll probably pick up.)
>
> That is a bad pattern, but still better as we prevent end users (even
> early adopters) go through the undocumented path in any way, and that will
> be explicitly marked as "should be fixed". This is different from retaining
> config - I don't expect unified create table syntax will be landed in
> bugfix version, so even unified create table syntax can be landed in 3.1.0
> (this is also not guaranteed) the config will live in 3.0.x in any way. If
> we temporarily go dirty way then we can clean up the code in any version,
> even from bugfix version, maybe within a couple of weeks just after 3.0.0
> is released.
>
> Does it sound valid?
>
> On Tue, May 12, 2020 at 2:35 PM Wenchen Fan  wrote:
>
>> SPARK-30098 was merged about 6 months ago. It's not a clean revert and we
>> may need to spend quite a bit of time to resolve conflicts and fix tests.
>>
>> I don't see why it's still a problem if a feature is disabled and hidden
>> from end-users (it's undocumented, the config is internal). The related
>> code will be replaced in the master branch sooner or later, when we unify
>> the syntaxes.
>>
>>
>>
>> On Tue, May 12, 2020 at 6:16 AM Ryan Blue 
>> wrote:
>>
>>> I'm all for getting the unified syntax into master. The only issue
>>> appears to be whether or not to pass the presence of the EXTERNAL keyword
>>> through to a catalog in v2. Maybe it's time to start a discuss thread for
>>> that issue so we're not stuck for another 6 weeks on it.
>>>
>>> On Mon, May 11, 2020 at 3:13 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 Btw another wondering here is, is it good to retain the flag on master
 as an intermediate step? Wouldn't it be better for us to start "unified
 create table syntax" from scratch?


 On Tue, May 12, 2020 at 6:50 AM Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:

> I'm sorry, but I have to agree with Ryan and Russell. I chose the
> option 1 because it's less worse than option 2, but it doesn't mean I 
> fully
> agree with option 1.
>
> Let's make below things clear if we really go with option 1, otherwise
> please consider reverting it.
>
> * Do you fully indicate about "all" the paths where the second create
> table syntax is taken?
> * Could you explain "why" to end users without any confusion? Do you
> think end users will understand it easily?
> * Do you have an actual end users to guide to turn this on? Or do you
> have a plan to turn this on for your team/customers and deal with
> the ambiguity?
> * Could you please document about how things will change if the flag
> is turned on?
>
> I guess the option 1 is to leave a flag as "undocumented" one and
> forget about the path to turn on, but I think that would lead to make the
> feature be "broken window" even we are not able to touch.
>
> On Tue, May 12, 2020 at 6:45 AM Russell Spitzer <
> russell.spit...@gmail.com> wrote:
>
>> I think reverting 30098 is the right decision here if we want to
>> unblock 3.0. We shouldn't ship with features which we know do not 
>> function
>> in the way we intend, regardless of how little exposure most users have 
>> to
>> them. Even if it's off my default, we should probably work to avoid
>> switches that cause things to behave unpredictably or require a flow 
>> chart
>> to actually determine what will happen.
>>
>> On Mon, May 11, 2020 at 3:07 PM Ryan Blue 
>> wrote:
>>
>>> I'm all for fixing behavior in master by turning this off as an
>>> intermediate step, but I don't think that Spark 3.0 can safely include
>>> SPARK-30098.
>>>
>>> The problem is that SPARK-30098 introduces strange behavior, as
>>> Jungtaek pointed out. And that behavior is not fully understood. While
>>> working on a unified CREATE 

unsubscribe

2020-05-12 Thread Kiran B
Thank you,
Kiran,


Re: [VOTE] Release Spark 2.4.6 (RC1)

2020-05-12 Thread Yuanjian Li
Thanks Holden and Dongjoon for the help!
The bugfix for SPARK-31663 is ready for review, hope it can be picked up in
2.4.7 if possible.
https://github.com/apache/spark/pull/28501

Best,
Yuanjian

Takeshi Yamamuro  于2020年5月11日周一 上午9:03写道:

> I checked on my MacOS env; all the tests
> with `-Pyarn -Phadoop-2.7 -Pdocker-integration-tests -Phive
> -Phive-thriftserver -Pmesos -Pkubernetes -Psparkr`
> passed and I couldn't find any issue;
>
> maropu@~:$java -version
> java version "1.8.0_181"
> Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
>
> Bests,
> Takeshi
>
>
> On Sun, May 10, 2020 at 2:50 AM Holden Karau  wrote:
>
>> Thanks Dongjoon :)
>> So it’s not a regression, but if it won’t be a large delay I think
>> holding for the correctness fix would be good (and we can pick up the two
>> issues fixed in 2.4.7). What does everyone think?
>>
>> On Fri, May 8, 2020 at 11:40 AM Dongjoon Hyun 
>> wrote:
>>
>>> I confirmed and update the JIRA. SPARK-31663 is a correctness issue
>>> since Apache Spark 2.4.0.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Fri, May 8, 2020 at 10:26 AM Holden Karau 
>>> wrote:
>>>
 Can you provide a bit more context (is it a regression?)

 On Fri, May 8, 2020 at 9:33 AM Yuanjian Li 
 wrote:

> Hi Holden,
>
> I'm working on the bugfix of SPARK-31663
> , let me post it
> here since it's a correctness bug and also affects 2.4.6.
>
> Best,
> Yuanjian
>
> Sean Owen  于2020年5月8日周五 下午11:42写道:
>
>> +1 from me. The usual: sigs OK, license looks as intended, tests pass
>> from a source build for me.
>>
>> On Thu, May 7, 2020 at 1:29 PM Holden Karau 
>> wrote:
>> >
>> > Please vote on releasing the following candidate as Apache Spark
>> version 2.4.6.
>> >
>> > The vote is open until February 5th 11PM PST and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 2.4.6
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see
>> http://spark.apache.org/
>> >
>> > There are currently no issues targeting 2.4.6 (try project = SPARK
>> AND "Target Version/s" = "2.4.6" AND status in (Open, Reopened, "In
>> Progress"))
>> >
>> > We _may_ want to hold the 2.4.6 release for something targetted to
>> 2.4.7 ( project = SPARK AND "Target Version/s" = "2.4.7") , currently,
>> SPARK-24266 & SPARK-26908 and I believe there is some discussion on if we
>> should include SPARK-31399 in this release.
>> >
>> > The tag to be voted on is v2.4.5-rc2 (commit
>> a3cffc997035d11e1f6c092c1186e943f2f63544):
>> > https://github.com/apache/spark/tree/v2.4.6-rc1
>> >
>> > The release files, including signatures, digests, etc. can be found
>> at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc1-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> >
>> https://repository.apache.org/content/repositories/orgapachespark-1340/
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc1-docs/
>> >
>> > The list of bug fixes going into 2.4.6 can be found at the
>> following URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12346781
>> >
>> > This release is using the release script of the tag v2.4.6-rc1.
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate,
>> then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and
>> install
>> > the current RC and see if anything important breaks, in the
>> Java/Scala
>> > you can add the staging repository to your projects resolvers and
>> test
>> > with the RC (make sure to clean up the artifact cache before/after
>> so
>> > you don't end up building with a out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 2.4.6?
>> > ===
>> >
>> > The current list of open tickets targeted at 2.4.5 can be found at:
>> > https://issues.apache.org/jira/projects/SPARK and search for
>> "Target Version/s" = 2.4.6
>> >
>> > Committers should look at those and 

Re: [VOTE] Release Spark 2.4.6 (RC1)

2020-05-12 Thread Holden Karau
Thanks. The 2.4.6 RC1 vote fails because we don’t have enough binding +1s,
I’ll start a new RC once 31663 is merged or next week whichever is first.

On Tue, May 12, 2020 at 7:28 AM Yuanjian Li  wrote:

> Thanks Holden and Dongjoon for the help!
> The bugfix for SPARK-31663 is ready for review, hope it can be picked up
> in 2.4.7 if possible.
> https://github.com/apache/spark/pull/28501
>
> Best,
> Yuanjian
>
> Takeshi Yamamuro  于2020年5月11日周一 上午9:03写道:
>
>> I checked on my MacOS env; all the tests
>> with `-Pyarn -Phadoop-2.7 -Pdocker-integration-tests -Phive
>> -Phive-thriftserver -Pmesos -Pkubernetes -Psparkr`
>> passed and I couldn't find any issue;
>>
>> maropu@~:$java -version
>> java version "1.8.0_181"
>> Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>> Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
>>
>> Bests,
>> Takeshi
>>
>>
>> On Sun, May 10, 2020 at 2:50 AM Holden Karau 
>> wrote:
>>
>>> Thanks Dongjoon :)
>>> So it’s not a regression, but if it won’t be a large delay I think
>>> holding for the correctness fix would be good (and we can pick up the two
>>> issues fixed in 2.4.7). What does everyone think?
>>>
>>> On Fri, May 8, 2020 at 11:40 AM Dongjoon Hyun 
>>> wrote:
>>>
 I confirmed and update the JIRA. SPARK-31663 is a correctness issue
 since Apache Spark 2.4.0.

 Bests,
 Dongjoon.

 On Fri, May 8, 2020 at 10:26 AM Holden Karau 
 wrote:

> Can you provide a bit more context (is it a regression?)
>
> On Fri, May 8, 2020 at 9:33 AM Yuanjian Li 
> wrote:
>
>> Hi Holden,
>>
>> I'm working on the bugfix of SPARK-31663
>> , let me post it
>> here since it's a correctness bug and also affects 2.4.6.
>>
>> Best,
>> Yuanjian
>>
>> Sean Owen  于2020年5月8日周五 下午11:42写道:
>>
>>> +1 from me. The usual: sigs OK, license looks as intended, tests pass
>>> from a source build for me.
>>>
>>> On Thu, May 7, 2020 at 1:29 PM Holden Karau 
>>> wrote:
>>> >
>>> > Please vote on releasing the following candidate as Apache Spark
>>> version 2.4.6.
>>> >
>>> > The vote is open until February 5th 11PM PST and passes if a
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>> >
>>> > [ ] +1 Release this package as Apache Spark 2.4.6
>>> > [ ] -1 Do not release this package because ...
>>> >
>>> > To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>> >
>>> > There are currently no issues targeting 2.4.6 (try project = SPARK
>>> AND "Target Version/s" = "2.4.6" AND status in (Open, Reopened, "In
>>> Progress"))
>>> >
>>> > We _may_ want to hold the 2.4.6 release for something targetted to
>>> 2.4.7 ( project = SPARK AND "Target Version/s" = "2.4.7") , currently,
>>> SPARK-24266 & SPARK-26908 and I believe there is some discussion on if 
>>> we
>>> should include SPARK-31399 in this release.
>>> >
>>> > The tag to be voted on is v2.4.5-rc2 (commit
>>> a3cffc997035d11e1f6c092c1186e943f2f63544):
>>> > https://github.com/apache/spark/tree/v2.4.6-rc1
>>> >
>>> > The release files, including signatures, digests, etc. can be
>>> found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc1-bin/
>>> >
>>> > Signatures used for Spark RCs can be found in this file:
>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >
>>> > The staging repository for this release can be found at:
>>> >
>>> https://repository.apache.org/content/repositories/orgapachespark-1340/
>>> >
>>> > The documentation corresponding to this release can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc1-docs/
>>> >
>>> > The list of bug fixes going into 2.4.6 can be found at the
>>> following URL:
>>> > https://issues.apache.org/jira/projects/SPARK/versions/12346781
>>> >
>>> > This release is using the release script of the tag v2.4.6-rc1.
>>> >
>>> > FAQ
>>> >
>>> > =
>>> > How can I help test this release?
>>> > =
>>> >
>>> > If you are a Spark user, you can help us test this release by
>>> taking
>>> > an existing Spark workload and running on this release candidate,
>>> then
>>> > reporting any regressions.
>>> >
>>> > If you're working in PySpark you can set up a virtual env and
>>> install
>>> > the current RC and see if anything important breaks, in the
>>> Java/Scala
>>> > you can add the staging repository to your projects resolvers and
>>> test
>>> > with the RC (make sure to clean up the artifact cache before/after
>>> so
>>> > you don't end up building with a out of date RC going forward).
>>> >
>>> > ===
>>> 

Re: [VOTE] Release Spark 2.4.6 (RC1)

2020-05-12 Thread Prashant Sharma
Hi Holden,

I am +1 on this release, the fix for SPARK-31663 can make it to next
release as well.

Thanks,

On Tue, May 12, 2020 at 8:09 PM Holden Karau  wrote:

> Thanks. The 2.4.6 RC1 vote fails because we don’t have enough binding +1s,
> I’ll start a new RC once 31663 is merged or next week whichever is first.
>
> On Tue, May 12, 2020 at 7:28 AM Yuanjian Li 
> wrote:
>
>> Thanks Holden and Dongjoon for the help!
>> The bugfix for SPARK-31663 is ready for review, hope it can be picked up
>> in 2.4.7 if possible.
>> https://github.com/apache/spark/pull/28501
>>
>> Best,
>> Yuanjian
>>
>> Takeshi Yamamuro  于2020年5月11日周一 上午9:03写道:
>>
>>> I checked on my MacOS env; all the tests
>>> with `-Pyarn -Phadoop-2.7 -Pdocker-integration-tests -Phive
>>> -Phive-thriftserver -Pmesos -Pkubernetes -Psparkr`
>>> passed and I couldn't find any issue;
>>>
>>> maropu@~:$java -version
>>> java version "1.8.0_181"
>>> Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>>> Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
>>>
>>> Bests,
>>> Takeshi
>>>
>>>
>>> On Sun, May 10, 2020 at 2:50 AM Holden Karau 
>>> wrote:
>>>
 Thanks Dongjoon :)
 So it’s not a regression, but if it won’t be a large delay I think
 holding for the correctness fix would be good (and we can pick up the two
 issues fixed in 2.4.7). What does everyone think?

 On Fri, May 8, 2020 at 11:40 AM Dongjoon Hyun 
 wrote:

> I confirmed and update the JIRA. SPARK-31663 is a correctness issue
> since Apache Spark 2.4.0.
>
> Bests,
> Dongjoon.
>
> On Fri, May 8, 2020 at 10:26 AM Holden Karau 
> wrote:
>
>> Can you provide a bit more context (is it a regression?)
>>
>> On Fri, May 8, 2020 at 9:33 AM Yuanjian Li 
>> wrote:
>>
>>> Hi Holden,
>>>
>>> I'm working on the bugfix of SPARK-31663
>>> , let me post it
>>> here since it's a correctness bug and also affects 2.4.6.
>>>
>>> Best,
>>> Yuanjian
>>>
>>> Sean Owen  于2020年5月8日周五 下午11:42写道:
>>>
 +1 from me. The usual: sigs OK, license looks as intended, tests
 pass
 from a source build for me.

 On Thu, May 7, 2020 at 1:29 PM Holden Karau 
 wrote:
 >
 > Please vote on releasing the following candidate as Apache Spark
 version 2.4.6.
 >
 > The vote is open until February 5th 11PM PST and passes if a
 majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
 >
 > [ ] +1 Release this package as Apache Spark 2.4.6
 > [ ] -1 Do not release this package because ...
 >
 > To learn more about Apache Spark, please see
 http://spark.apache.org/
 >
 > There are currently no issues targeting 2.4.6 (try project =
 SPARK AND "Target Version/s" = "2.4.6" AND status in (Open, Reopened, 
 "In
 Progress"))
 >
 > We _may_ want to hold the 2.4.6 release for something targetted
 to 2.4.7 ( project = SPARK AND "Target Version/s" = "2.4.7") , 
 currently,
 SPARK-24266 & SPARK-26908 and I believe there is some discussion on if 
 we
 should include SPARK-31399 in this release.
 >
 > The tag to be voted on is v2.4.5-rc2 (commit
 a3cffc997035d11e1f6c092c1186e943f2f63544):
 > https://github.com/apache/spark/tree/v2.4.6-rc1
 >
 > The release files, including signatures, digests, etc. can be
 found at:
 > https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc1-bin/
 >
 > Signatures used for Spark RCs can be found in this file:
 > https://dist.apache.org/repos/dist/dev/spark/KEYS
 >
 > The staging repository for this release can be found at:
 >
 https://repository.apache.org/content/repositories/orgapachespark-1340/
 >
 > The documentation corresponding to this release can be found at:
 > https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc1-docs/
 >
 > The list of bug fixes going into 2.4.6 can be found at the
 following URL:
 > https://issues.apache.org/jira/projects/SPARK/versions/12346781
 >
 > This release is using the release script of the tag v2.4.6-rc1.
 >
 > FAQ
 >
 > =
 > How can I help test this release?
 > =
 >
 > If you are a Spark user, you can help us test this release by
 taking
 > an existing Spark workload and running on this release candidate,
 then
 > reporting any regressions.
 >
 > If you're working in PySpark you can set up a virtual env and
 install
 > the current RC and see if anything important breaks, in the
 

[Datasource V2] Exception Handling for Catalogs - Naming Suggestions

2020-05-12 Thread Russell Spitzer
Currently the way some actions work, we receive an error during analysis
phase. For example, doing a "SELECT * FROM non_existent_table" will return
an analysis exception as the NoSuchTableException is captured and replaced.

Other actions like the "ShowNamespaceExec" call catalog methods directly
and directly throw a NoSuchNamsSpaceException

While I don't think the difference here is a big deal, It would be nice if
we could have just one set of behaviors. My interest here was being able to
throw custom NoSuchTable and NoSuchNameSpace exceptions
which contained naming suggestions.

For example

*SELECT * from  catalog.keyspace.tuble*

Could optionally return an analysis exception which suggested other names
of tables (or keyspaces) which contained like

"Could not find keyspace.tuble, found a near hit: keyspace.table"

While I have the code to do this internally within my own catalog because
of previous functionality,
 I cannot surface this information to the user anymore in DSV2 because of
the way the exceptions are handled.


Given this, I'm wondering if we could wrap and rethrow the NoSuchExceptions
or would it be better for catalogs to support an interfaces like


trait SupportsTableNamingSuggestions extends TableCatalog {
  val matchThreshold = 0.7
  val matcher: (String, String) => Double = JaroWinkler // Or some other
string comparison algorithm

  /** Given an identifier, return a list of identifiers that it may have
been mistaken for"
  default def tableSuggestions(missingIdent: Identifier): Seq[Ident] {
for (table <- getTables(ident.namespace) if matcher(missingIdent.table,
ident.table) > matchThreshold) yield {
 table
}
  }
}
// And another for namespaces

Then on analysis failures, if the catalog supports this trait we could
supplement the analysis failure with some helpful info. We could leave this
as a trait for
those implementations whose listing of tables or namespaces is costly.


Thanks for your consideration, this is the kind of feature that I think is
very useful to end users and we can add with pretty limited cost,
Russ


Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-12 Thread Ryan Blue
+1 for the approach Jungtaek suggests. That will avoid needing to support
behavior that is not well understood with minimal changes.

On Tue, May 12, 2020 at 1:45 AM Jungtaek Lim 
wrote:

> Before I forget, we'd better not forget to change the doc, as create table
> doc looks to represent current syntax which will be incorrect later.
>
> On Tue, May 12, 2020 at 5:32 PM Jungtaek Lim 
> wrote:
>
>> It's not only for end users, but also for us. Spark itself uses the
>> config "true" and "false" in tests and it still brings confusion. We still
>> have to deal with both situations.
>>
>> I'm wondering how long days it would be needed to revert it cleanly, but
>> if we worry about the amount of code change just around the new RC, what
>> about make the code dirty (should be fixed soon) but less headache via
>> applying traditional (and bad) way?
>>
>> Let's just remove the config so that the config cannot be used in any way
>> (even in Spark codebase), and set corresponding field in parser to the
>> constant value so that no one can modify in any way. This would make the
>> dead code by intention which should be cleaned it up later, so let's add
>> FIXME comment there so that anyone can take it up for cleaning up the code
>> later. (If no one volunteers then I'll probably pick up.)
>>
>> That is a bad pattern, but still better as we prevent end users (even
>> early adopters) go through the undocumented path in any way, and that will
>> be explicitly marked as "should be fixed". This is different from retaining
>> config - I don't expect unified create table syntax will be landed in
>> bugfix version, so even unified create table syntax can be landed in 3.1.0
>> (this is also not guaranteed) the config will live in 3.0.x in any way. If
>> we temporarily go dirty way then we can clean up the code in any version,
>> even from bugfix version, maybe within a couple of weeks just after 3.0.0
>> is released.
>>
>> Does it sound valid?
>>
>> On Tue, May 12, 2020 at 2:35 PM Wenchen Fan  wrote:
>>
>>> SPARK-30098 was merged about 6 months ago. It's not a clean revert and
>>> we may need to spend quite a bit of time to resolve conflicts and fix tests.
>>>
>>> I don't see why it's still a problem if a feature is disabled and hidden
>>> from end-users (it's undocumented, the config is internal). The related
>>> code will be replaced in the master branch sooner or later, when we unify
>>> the syntaxes.
>>>
>>>
>>>
>>> On Tue, May 12, 2020 at 6:16 AM Ryan Blue 
>>> wrote:
>>>
 I'm all for getting the unified syntax into master. The only issue
 appears to be whether or not to pass the presence of the EXTERNAL keyword
 through to a catalog in v2. Maybe it's time to start a discuss thread for
 that issue so we're not stuck for another 6 weeks on it.

 On Mon, May 11, 2020 at 3:13 PM Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:

> Btw another wondering here is, is it good to retain the flag on master
> as an intermediate step? Wouldn't it be better for us to start "unified
> create table syntax" from scratch?
>
>
> On Tue, May 12, 2020 at 6:50 AM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> I'm sorry, but I have to agree with Ryan and Russell. I chose the
>> option 1 because it's less worse than option 2, but it doesn't mean I 
>> fully
>> agree with option 1.
>>
>> Let's make below things clear if we really go with option 1,
>> otherwise please consider reverting it.
>>
>> * Do you fully indicate about "all" the paths where the second create
>> table syntax is taken?
>> * Could you explain "why" to end users without any confusion? Do you
>> think end users will understand it easily?
>> * Do you have an actual end users to guide to turn this on? Or do you
>> have a plan to turn this on for your team/customers and deal with
>> the ambiguity?
>> * Could you please document about how things will change if the flag
>> is turned on?
>>
>> I guess the option 1 is to leave a flag as "undocumented" one and
>> forget about the path to turn on, but I think that would lead to make the
>> feature be "broken window" even we are not able to touch.
>>
>> On Tue, May 12, 2020 at 6:45 AM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> I think reverting 30098 is the right decision here if we want to
>>> unblock 3.0. We shouldn't ship with features which we know do not 
>>> function
>>> in the way we intend, regardless of how little exposure most users have 
>>> to
>>> them. Even if it's off my default, we should probably work to avoid
>>> switches that cause things to behave unpredictably or require a flow 
>>> chart
>>> to actually determine what will happen.
>>>
>>> On Mon, May 11, 2020 at 3:07 PM Ryan Blue 
>>> wrote:
>>>
 I'm all for fixing behavior in master by turning this off as an

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-12 Thread Russell Spitzer
I think that the dead code approach, while a bit unpalatable and worse than
reverting, is probably better than leaving the parameter (even if it is
hidden)

On Tue, May 12, 2020 at 12:46 PM Ryan Blue  wrote:

> +1 for the approach Jungtaek suggests. That will avoid needing to support
> behavior that is not well understood with minimal changes.
>
> On Tue, May 12, 2020 at 1:45 AM Jungtaek Lim 
> wrote:
>
>> Before I forget, we'd better not forget to change the doc, as create
>> table doc looks to represent current syntax which will be incorrect later.
>>
>> On Tue, May 12, 2020 at 5:32 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> It's not only for end users, but also for us. Spark itself uses the
>>> config "true" and "false" in tests and it still brings confusion. We still
>>> have to deal with both situations.
>>>
>>> I'm wondering how long days it would be needed to revert it cleanly, but
>>> if we worry about the amount of code change just around the new RC, what
>>> about make the code dirty (should be fixed soon) but less headache via
>>> applying traditional (and bad) way?
>>>
>>> Let's just remove the config so that the config cannot be used in any
>>> way (even in Spark codebase), and set corresponding field in parser to the
>>> constant value so that no one can modify in any way. This would make the
>>> dead code by intention which should be cleaned it up later, so let's add
>>> FIXME comment there so that anyone can take it up for cleaning up the code
>>> later. (If no one volunteers then I'll probably pick up.)
>>>
>>> That is a bad pattern, but still better as we prevent end users (even
>>> early adopters) go through the undocumented path in any way, and that will
>>> be explicitly marked as "should be fixed". This is different from retaining
>>> config - I don't expect unified create table syntax will be landed in
>>> bugfix version, so even unified create table syntax can be landed in 3.1.0
>>> (this is also not guaranteed) the config will live in 3.0.x in any way. If
>>> we temporarily go dirty way then we can clean up the code in any version,
>>> even from bugfix version, maybe within a couple of weeks just after 3.0.0
>>> is released.
>>>
>>> Does it sound valid?
>>>
>>> On Tue, May 12, 2020 at 2:35 PM Wenchen Fan  wrote:
>>>
 SPARK-30098 was merged about 6 months ago. It's not a clean revert and
 we may need to spend quite a bit of time to resolve conflicts and fix 
 tests.

 I don't see why it's still a problem if a feature is disabled and
 hidden from end-users (it's undocumented, the config is internal). The
 related code will be replaced in the master branch sooner or later, when we
 unify the syntaxes.



 On Tue, May 12, 2020 at 6:16 AM Ryan Blue 
 wrote:

> I'm all for getting the unified syntax into master. The only issue
> appears to be whether or not to pass the presence of the EXTERNAL keyword
> through to a catalog in v2. Maybe it's time to start a discuss thread for
> that issue so we're not stuck for another 6 weeks on it.
>
> On Mon, May 11, 2020 at 3:13 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> Btw another wondering here is, is it good to retain the flag on
>> master as an intermediate step? Wouldn't it be better for us to
>> start "unified create table syntax" from scratch?
>>
>>
>> On Tue, May 12, 2020 at 6:50 AM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> I'm sorry, but I have to agree with Ryan and Russell. I chose the
>>> option 1 because it's less worse than option 2, but it doesn't mean I 
>>> fully
>>> agree with option 1.
>>>
>>> Let's make below things clear if we really go with option 1,
>>> otherwise please consider reverting it.
>>>
>>> * Do you fully indicate about "all" the paths where the second
>>> create table syntax is taken?
>>> * Could you explain "why" to end users without any confusion? Do you
>>> think end users will understand it easily?
>>> * Do you have an actual end users to guide to turn this on? Or do
>>> you have a plan to turn this on for your team/customers and deal with
>>> the ambiguity?
>>> * Could you please document about how things will change if the flag
>>> is turned on?
>>>
>>> I guess the option 1 is to leave a flag as "undocumented" one and
>>> forget about the path to turn on, but I think that would lead to make 
>>> the
>>> feature be "broken window" even we are not able to touch.
>>>
>>> On Tue, May 12, 2020 at 6:45 AM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
 I think reverting 30098 is the right decision here if we want to
 unblock 3.0. We shouldn't ship with features which we know do not 
 function
 in the way we intend, regardless of how little exposure most users 
 have to
 them.