spark lacks fault tolerance with dynamic partition overwrite

2020-04-02 Thread Koert Kuipers
i wanted to highlight here the issue we are facing with dynamic partition
overwrite.

it seems that any tasks that writes to disk using this feature and that
need to be retried fails upon retry, leading to a failure for the entire
job.

we have seen this issue show up with preemption (task gets killed by
pre-emption, and when it gets rescheduled it fails consistently). it can
also show up if a hardware issue causes your task to fail, or if you have
speculative execution enabled.

relevant jiras are SPARK-30320 and SPARK-29302

this affects spark 2.4.x and spark 3.0.0-SNAPSHOT
writing to hive does not seem to be impacted.

best,
koert


Fwd: Automatic PR labeling

2020-04-02 Thread Hyukjin Kwon
Seems like this email missed to cc the mailing list, forwarding it for
trackability.

-- Forwarded message -
보낸사람: Ismaël Mejía 
Date: 2020년 4월 2일 (목) 오후 4:46
Subject: Re: Automatic PR labeling
To: Hyukjin Kwon 


+1

Just for ref there is a really simple Github App for this:
https://github.com/mithro/autolabeler

You just have to configure a simple yml file with the paths to match and the
tags, as an example this is the one we are using for Apache Avro:
https://github.com/apache/avro/blob/master/.github/autolabeler.yml

Then someone (ideally from the PMC) should fill a ticket to INFRA to
install it
for the project, as a ref too:
https://issues.apache.org/jira/browse/INFRA-17367


Re: [VOTE] Apache Spark 3.0.0 RC1

2020-04-02 Thread Takeshi Yamamuro
Also, I think the 3.0 release had better to include all the SQL document
updates:
https://issues.apache.org/jira/browse/SPARK-28588

On Fri, Apr 3, 2020 at 12:36 AM Sean Owen  wrote:

> (If it wasn't stated explicitly, yeah I think we knew there are a few
> important unresolved issues and that this RC was going to fail. Let's
> all please test anyway of course, to flush out any additional issues,
> rather than wait. Pipelining and all that.)
>
> On Thu, Apr 2, 2020 at 10:31 AM Maxim Gekk 
> wrote:
> >
> > -1 (non-binding)
> >
> > The problem of compatibility with Spark 2.4 in reading/writing
> dates/timestamps hasn't been solved completely so far. In particular, the
> sub-task https://issues.apache.org/jira/browse/SPARK-31328 hasn't
> resolved yet.
> >
> > Maxim Gekk
> >
> > Software Engineer
> >
> > Databricks, Inc.
> >
> >
> >
> > On Wed, Apr 1, 2020 at 7:09 PM Ryan Blue 
> wrote:
> >>
> >> -1 (non-binding)
> >>
> >> I agree with Jungtaek. The change to create datasource tables instead
> of Hive tables by default (no USING or STORED AS clauses) has created
> confusing behavior and should either be rolled back or fixed before 3.0.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro


Re: Automatic PR labeling

2020-04-02 Thread Hyukjin Kwon
Awesome!

2020년 4월 3일 (금) 오전 7:13, Nicholas Chammas 님이 작성:

> SPARK-31330 :
> Automatically label PRs based on the paths they touch
>
> On Wed, Apr 1, 2020 at 11:34 PM Hyukjin Kwon  wrote:
>
>> @Nicholas Chammas  Would you be interested
>> in tacking a look? I would love this to be done.
>>
>> 2020년 3월 25일 (수) 오전 10:30, Hyukjin Kwon 님이 작성:
>>
>>> That should be cool. There were a bit of discussions about which account
>>> should label. If we can replace it, I think it sounds great!
>>>
>>> 2020년 3월 25일 (수) 오전 5:08, Nicholas Chammas 님이
>>> 작성:
>>>
 Public Service Announcement: There is a GitHub action that lets you
 automatically label PRs based on what paths they modify.

 https://github.com/actions/labeler

 If we set this up, perhaps down the line we can update the PR dashboard
 and PR merge script to use the tags.

 cc @Dongjoon Hyun , who may be interested in
 this.

 Nick

>>>


Re: [DISCUSS] filling affected versions on JIRA issue

2020-04-02 Thread Jungtaek Lim
On Fri, Apr 3, 2020 at 12:31 AM Sean Owen  wrote:

> On Wed, Apr 1, 2020 at 10:28 PM Jungtaek Lim
>  wrote:
> > The definition of "latest version" would matter, especially there's a
> time we prepare minor+ version release.
> >
> > For example, lots of people (even including committers) filed an
> "improvement" issue with setting fix version to 3.0, which is NOT incorrect
> in point of "release", but incorrect in point of the version of "master
> branch". If we say it as "latest" version, maybe they should not even be
> set to 3.0. Looks like it still confuses someone; we need to make clear
> which version it should if we really want to require it, and should be
> documented.
>
> OK shall we simply say, tag it with whatever version you were using
> when you found the bug? If the reporter or anyone else knows it
> affects other versions, sure, add that. Point being, I don't think we
> should ask people to investigate which N versions it affects, unless
> it's particularly vital to the nature of the issue.
>

The paragraph is aimed for "new feature" / "improvement", not for "bug". I
guess we have consensus for "bug".


> For improvements, it matters less, and simply saying 'the latest
> version' is a fine default.
>

I meant 'the latest version' is confusing one. If someone is not tightly
involved into Spark development then they would consider the latest version
as 'the latest released version'. Even for someone who works on Spark
development it gives confusion when we cut a branch for minor+ version,
master branch goes up to (unreleased version + 1), which some of us may
think the latest version as the minor+ version we will target to release.

It would avoid the confusion if we decide the definition of 'the latest
version', and define the standard way to get 'the latest version'. and
document into contribution guide page. Otherwise we would have different
understanding and try to guide with different version, or even try to
correct others.


> Is there another desired standard out there that we're debating
> against? so far this sounds like existing practice.
>

Existing practice is not documented and gives confusion - 3.1.0 vs 3.0.0
for now. Though I believe leaving it empty or N/A if the field is required
should be even better.


> > Also I'm not in favor of bumping affect version in existing improvement
> issues when bumping up the minor+ version. As I said, I'm not sure we get
> some benefits from there. Even more, once batch updates are executed, lots
> of notifications happen in issue@ and these issues bump to the top in
> mail inbox, whereas technically they have no actual update. I'd rather say
> we should do opposite, don't update it to leave some context which version
> it was considered.
>
> What is this referring to - there are some batch updates of affected
> version? could be fine but for what reason?
> You can disable sending an email for bulk updates in JIRA, if that's the
> issue.
>

If you're subscribing to issue@ then you've recognized there were bulk
updates after cutting the branch for 3.0. (That's not controllable by
myself unless I unsubscribe.) I didn't do bulk update by myself - as I said
I'm not in favor of updating version.

And bulk mail update is just a side-effect and not the main issue, of
course. The main point is that the benefits/values: personally I don't see
any value from doing that. If we are on the same page then why we do that,
and if someone objects then doesn't it the thing we need to discuss?

> I'm assuming that we should require the affect version even for non-bug
> issue, but yes if possible I'd in favor of leave it empty. In any way let's
> document it explicitly.
>
> Agree. I think it's required in JIRA just because we can't require it
> for Bugs but not Improvements though?
>

There's "N/A" which could simply play as a dummy marker which would avoid
confusion at any time. I'm not expert on JIRA automation (especially ASF
JIRA) but we might be able to ask ASF INFRA to do it automatically for the
some types. Even it can't be automated, still clearer to set the version.
WDYT?


Beginner PR against the Catalog API

2020-04-02 Thread Nicholas Chammas
I recently submitted my first Scala PR. It's very simple, though I don't
know if I've done things correctly since I'm not a regular Scala user.

SPARK-31000 : Add
ability to set table description in the catalog

https://github.com/apache/spark/pull/27908

Would someone be able to take a look at it and give me some feedback?

Nick


Re: Automatic PR labeling

2020-04-02 Thread Nicholas Chammas
SPARK-31330 :
Automatically label PRs based on the paths they touch

On Wed, Apr 1, 2020 at 11:34 PM Hyukjin Kwon  wrote:

> @Nicholas Chammas  Would you be interested in
> tacking a look? I would love this to be done.
>
> 2020년 3월 25일 (수) 오전 10:30, Hyukjin Kwon 님이 작성:
>
>> That should be cool. There were a bit of discussions about which account
>> should label. If we can replace it, I think it sounds great!
>>
>> 2020년 3월 25일 (수) 오전 5:08, Nicholas Chammas 님이
>> 작성:
>>
>>> Public Service Announcement: There is a GitHub action that lets you
>>> automatically label PRs based on what paths they modify.
>>>
>>> https://github.com/actions/labeler
>>>
>>> If we set this up, perhaps down the line we can update the PR dashboard
>>> and PR merge script to use the tags.
>>>
>>> cc @Dongjoon Hyun , who may be interested in
>>> this.
>>>
>>> Nick
>>>
>>


Re: [VOTE] Apache Spark 3.0.0 RC1

2020-04-02 Thread Sean Owen
(If it wasn't stated explicitly, yeah I think we knew there are a few
important unresolved issues and that this RC was going to fail. Let's
all please test anyway of course, to flush out any additional issues,
rather than wait. Pipelining and all that.)

On Thu, Apr 2, 2020 at 10:31 AM Maxim Gekk  wrote:
>
> -1 (non-binding)
>
> The problem of compatibility with Spark 2.4 in reading/writing 
> dates/timestamps hasn't been solved completely so far. In particular, the 
> sub-task https://issues.apache.org/jira/browse/SPARK-31328 hasn't resolved 
> yet.
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>
>
>
> On Wed, Apr 1, 2020 at 7:09 PM Ryan Blue  wrote:
>>
>> -1 (non-binding)
>>
>> I agree with Jungtaek. The change to create datasource tables instead of 
>> Hive tables by default (no USING or STORED AS clauses) has created confusing 
>> behavior and should either be rolled back or fixed before 3.0.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Apache Spark 3.0.0 RC1

2020-04-02 Thread Maxim Gekk
-1 (non-binding)

The problem of compatibility with Spark 2.4 in reading/writing
dates/timestamps hasn't been solved completely so far. In particular, the
sub-task https://issues.apache.org/jira/browse/SPARK-31328 hasn't resolved
yet.

Maxim Gekk

Software Engineer

Databricks, Inc.


On Wed, Apr 1, 2020 at 7:09 PM Ryan Blue  wrote:

> -1 (non-binding)
>
> I agree with Jungtaek. The change to create datasource tables instead of
> Hive tables by default (no USING or STORED AS clauses) has created
> confusing behavior and should either be rolled back or fixed before 3.0.
>
> On Wed, Apr 1, 2020 at 5:12 AM Sean Owen  wrote:
>
>> Those are not per se release blockers. They are (perhaps important)
>> improvements to functionality. I don't know who is active and able to
>> review that part of the code; I'd look for authors of changes in the
>> surrounding code. The question here isn't so much what one would like
>> to see in this release, but evaluating whether the release is sound
>> and free of show-stopper problems. There will always be potentially
>> important changes and fixes to come.
>>
>> On Wed, Apr 1, 2020 at 5:31 AM Dr. Kent Yao  wrote:
>> >
>> > -1
>> > Do not release this package because v3.0.0 is the 3rd major release
>> since we
>> > added Spark On Kubernetes. Can we make it more production-ready as it
>> has
>> > been experimental for more than 2 years?
>> >
>> > The main practical adoption of Spark on Kubernetes is to take on the
>> role of
>> > other cluster managers(mainly YARN). And the storage layer(mainly HDFS)
>> > would be more likely kept anyway. But Spark on Kubernetes with HDFS
>> seems
>> > not to work properly.
>> >
>> > e.g.
>> > This ticket and PR were submitted 7 months ago, and never get reviewed.
>> > https://issues.apache.org/jira/browse/SPARK-29974
>> > https://issues.apache.org/jira/browse/SPARK-28992
>> > https://github.com/apache/spark/pull/25695
>> >
>> > And this.
>> > https://issues.apache.org/jira/browse/SPARK-28896
>> > https://github.com/apache/spark/pull/25609
>> >
>> > In terms of how often this module is updated, it seems to be stable.
>> > But in terms of how often PRs for this module are reviewed, it seems
>> that it
>> > will stay experimental for a long time.
>> >
>> > Thanks.
>> >
>> >
>> >
>> > --
>> > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: [DISCUSS] filling affected versions on JIRA issue

2020-04-02 Thread Sean Owen
On Wed, Apr 1, 2020 at 10:28 PM Jungtaek Lim
 wrote:
> The definition of "latest version" would matter, especially there's a time we 
> prepare minor+ version release.
>
> For example, lots of people (even including committers) filed an 
> "improvement" issue with setting fix version to 3.0, which is NOT incorrect 
> in point of "release", but incorrect in point of the version of "master 
> branch". If we say it as "latest" version, maybe they should not even be set 
> to 3.0. Looks like it still confuses someone; we need to make clear which 
> version it should if we really want to require it, and should be documented.

OK shall we simply say, tag it with whatever version you were using
when you found the bug? If the reporter or anyone else knows it
affects other versions, sure, add that. Point being, I don't think we
should ask people to investigate which N versions it affects, unless
it's particularly vital to the nature of the issue.

For improvements, it matters less, and simply saying 'the latest
version' is a fine default.

Is there another desired standard out there that we're debating
against? so far this sounds like existing practice.


> Also I'm not in favor of bumping affect version in existing improvement 
> issues when bumping up the minor+ version. As I said, I'm not sure we get 
> some benefits from there. Even more, once batch updates are executed, lots of 
> notifications happen in issue@ and these issues bump to the top in mail 
> inbox, whereas technically they have no actual update. I'd rather say we 
> should do opposite, don't update it to leave some context which version it 
> was considered.

What is this referring to - there are some batch updates of affected
version? could be fine but for what reason?
You can disable sending an email for bulk updates in JIRA, if that's the issue.


> I'm assuming that we should require the affect version even for non-bug 
> issue, but yes if possible I'd in favor of leave it empty. In any way let's 
> document it explicitly.

Agree. I think it's required in JIRA just because we can't require it
for Bugs but not Improvements though?

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org