Re: Automatic PR labeling

2020-04-01 Thread Hyukjin Kwon
@Nicholas Chammas  Would you be interested in
tacking a look? I would love this to be done.

2020년 3월 25일 (수) 오전 10:30, Hyukjin Kwon 님이 작성:

> That should be cool. There were a bit of discussions about which account
> should label. If we can replace it, I think it sounds great!
>
> 2020년 3월 25일 (수) 오전 5:08, Nicholas Chammas 님이
> 작성:
>
>> Public Service Announcement: There is a GitHub action that lets you
>> automatically label PRs based on what paths they modify.
>>
>> https://github.com/actions/labeler
>>
>> If we set this up, perhaps down the line we can update the PR dashboard
>> and PR merge script to use the tags.
>>
>> cc @Dongjoon Hyun , who may be interested in
>> this.
>>
>> Nick
>>
>


Re: [DISCUSS] filling affected versions on JIRA issue

2020-04-01 Thread Jungtaek Lim
I didn't point out actual case "intentionally", because I want to avoid
unnecessary debate and make sure we don't decide with bias. Note that the
context would include people.

I have been seen these requests consistently (at least consistently for 1,
but I feel I also saw 2 more than couple of time as well), so let's just
treat it as "trends" and discuss it in general for this mail thread. We
know there's some exceptional case and in such case we'd rather discuss it
there.

> It's not wrong to say it affects the latest version, at least.
> And I believe we require it in JIRA because we can't require an
> Affects Version for one type of issue but not another. So, just asking
> people to default to 'latest version' there is no burden.

The definition of "latest version" would matter, especially there's a time
we prepare minor+ version release.

For example, lots of people (even including committers) filed an
"improvement" issue with setting fix version to 3.0, which is NOT incorrect
in point of "release", but incorrect in point of the version of "master
branch". If we say it as "latest" version, maybe they should not even be
set to 3.0. Looks like it still confuses someone; we need to make clear
which version it should if we really want to require it, and should be
documented.

Also I'm not in favor of bumping affect version in existing improvement
issues when bumping up the minor+ version. As I said, I'm not sure we get
some benefits from there. Even more, once batch updates are executed, lots
of notifications happen in issue@ and these issues bump to the top in mail
inbox, whereas technically they have no actual update. I'd rather say we
should do opposite, don't update it to leave some context which version it
was considered.

I'm assuming that we should require the affect version even for non-bug
issue, but yes if possible I'd in favor of leave it empty. In any way let's
document it explicitly.

For bugs I guess we are on the same page - there're some details but in
general we don't require to check the old versions. I'd be OK with checking
against "latest" active version branches to evaluate the possibility of
backport, but that's all. Even for security / correctness issues we may
want to define lower versions to check - do we want to check these issues
with EOL version lines? If yes, even 1.x?


On Thu, Apr 2, 2020 at 11:43 AM Nicholas Chammas 
wrote:

> Probably the discussion here about Improvement Jira tickets and the
> "Affects Version" field:
> https://github.com/apache/spark/pull/27534#issuecomment-588416416
>
> On Wed, Apr 1, 2020 at 9:59 PM Hyukjin Kwon  wrote:
>
>> > 2) check with older versions to fill up affects version for bug
>> I don't agree with this in general. To me usually it's "For the type of
>> bug, assign one valid version" instead.
>>
>> > The only place where I can see some amount of investigation being
>> required would be for security issues or correctness issues.
>> Yes, I agree.
>>
>> Yes, was there a particular case or context that motivated this thread?
>>
>> 2020년 4월 2일 (목) 오전 10:24, Mridul Muralidharan 님이 작성:
>>
>>>
>>> I agree with what Sean detailed.
>>> The only place where I can see some amount of investigation being
>>> required would be for security issues or correctness issues.
>>> Knowing the affected versions, particularly if an earlier supported
>>> version does not have the bug, will help users understand the
>>> broken/insecure versions.
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>> On Wed, Apr 1, 2020 at 6:12 PM Sean Owen  wrote:
>>>
 I think we discussed this briefly on a PR.

 It's not as clear what it means for an Improvement to 'affect a
 version'. Certainly, an improvement to a feature introduced in 1.2.3
 can't affect anything earlier, and implicitly affects everything
 after. It's not wrong to say it affects the latest version, at least.
 And I believe we require it in JIRA because we can't require an
 Affects Version for one type of issue but not another. So, just asking
 people to default to 'latest version' there is no burden.

 I would not ask someone to figure out all and earliest versions that
 an Improvement applies to; it just isn't that useful. We aren't
 generally going to back-port improvements anyway.

 Even for bugs, we don't really need to know that a bug in master
 affects 2.4.5, 2.4.4, 2.4.3, ... 2.3.6, 2.3.5, etc. It doesn't hurt to
 at least say it affects the latest 2.4.x, 2.3.x releases, if known,
 because it's possible it should be back-ported. Again even where this
 is significantly more useful, I'm not in favor of telling people they
 must test the bug report vs previous releases.

 So, if you're asserting that the current guidance is OK, I generally
 agree.
 Is there a particular context where this was questioned? maybe we
 should examine the particulars of that situation. As in all things,
 context matters.

 Sean

>>

Re: [DISCUSS] filling affected versions on JIRA issue

2020-04-01 Thread Nicholas Chammas
Probably the discussion here about Improvement Jira tickets and the
"Affects Version" field:
https://github.com/apache/spark/pull/27534#issuecomment-588416416

On Wed, Apr 1, 2020 at 9:59 PM Hyukjin Kwon  wrote:

> > 2) check with older versions to fill up affects version for bug
> I don't agree with this in general. To me usually it's "For the type of
> bug, assign one valid version" instead.
>
> > The only place where I can see some amount of investigation being
> required would be for security issues or correctness issues.
> Yes, I agree.
>
> Yes, was there a particular case or context that motivated this thread?
>
> 2020년 4월 2일 (목) 오전 10:24, Mridul Muralidharan 님이 작성:
>
>>
>> I agree with what Sean detailed.
>> The only place where I can see some amount of investigation being
>> required would be for security issues or correctness issues.
>> Knowing the affected versions, particularly if an earlier supported
>> version does not have the bug, will help users understand the
>> broken/insecure versions.
>>
>> Regards,
>> Mridul
>>
>>
>> On Wed, Apr 1, 2020 at 6:12 PM Sean Owen  wrote:
>>
>>> I think we discussed this briefly on a PR.
>>>
>>> It's not as clear what it means for an Improvement to 'affect a
>>> version'. Certainly, an improvement to a feature introduced in 1.2.3
>>> can't affect anything earlier, and implicitly affects everything
>>> after. It's not wrong to say it affects the latest version, at least.
>>> And I believe we require it in JIRA because we can't require an
>>> Affects Version for one type of issue but not another. So, just asking
>>> people to default to 'latest version' there is no burden.
>>>
>>> I would not ask someone to figure out all and earliest versions that
>>> an Improvement applies to; it just isn't that useful. We aren't
>>> generally going to back-port improvements anyway.
>>>
>>> Even for bugs, we don't really need to know that a bug in master
>>> affects 2.4.5, 2.4.4, 2.4.3, ... 2.3.6, 2.3.5, etc. It doesn't hurt to
>>> at least say it affects the latest 2.4.x, 2.3.x releases, if known,
>>> because it's possible it should be back-ported. Again even where this
>>> is significantly more useful, I'm not in favor of telling people they
>>> must test the bug report vs previous releases.
>>>
>>> So, if you're asserting that the current guidance is OK, I generally
>>> agree.
>>> Is there a particular context where this was questioned? maybe we
>>> should examine the particulars of that situation. As in all things,
>>> context matters.
>>>
>>> Sean
>>>
>>> On Wed, Apr 1, 2020 at 7:34 PM Jungtaek Lim
>>>  wrote:
>>> >
>>> > Hi devs,
>>> >
>>> > I know we're busy with making Spark 3.0 be out, but I think the topic
>>> is good to discuss at any time and actually be better to be resolved sooner
>>> than later.
>>> >
>>> > In the page "Contributing to Spark", we describe the guide of "affects
>>> version" as "For Bugs, assign at least one version that is known to exhibit
>>> the problem or need the change".
>>> >
>>> > For me, that sentence clearly describes minimal requirement of affects
>>> version via:
>>> >
>>> > * For the type of bug, assign one valid version
>>> > * For other types, there's no requirement
>>> >
>>> > but I'm seeing the requests more than the requirement which makes me
>>> think there might be different understanding of the sentence. Maybe there's
>>> more, but to summarize on such requests:
>>> >
>>> > 1) add affects version as same as master branch for improvement/new
>>> feature
>>> > 2) check with older versions to fill up affects version for bug
>>> >
>>> > I don't see any point on doing 1). It might give some context if we
>>> don't update the affect version (so that it can say which version was
>>> considered when filing JIRA issue) but we also update the affect version
>>> when we bump the master branch, which is no longer informational as the
>>> version should have been always the same as master branch.
>>> >
>>> > I agree it's ideal to do 2) but I think the reason the guide doesn't
>>> enforce is that it requires pretty much efforts to check with old versions
>>> (sometimes even more than origin work).
>>> >
>>> > Suppose the happy case we have UT to verify the bugfix which fails
>>> without the patch and passes with the patch. To check with older versions
>>> we have to checkout the tag, and apply the UT, and "rebuild", and run UT to
>>> verify which is pretty much time-consuming. What if there's a conflict
>>> indeed? That's still a happy case, and in worse case (there's no such UT)
>>> we should do E2E manual verification which I would give up.
>>> >
>>> > There should have some balance/threshold, and the balance should be
>>> the thing the community has a consensus.
>>> >
>>> > Would like to hear everyone's voice on this.
>>> >
>>> > Thanks,
>>> > Jungtaek Lim (HeartSaVioR)
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: [DISCUSS] filling affected versions on JIRA issue

2020-04-01 Thread Hyukjin Kwon
> 2) check with older versions to fill up affects version for bug
I don't agree with this in general. To me usually it's "For the type of
bug, assign one valid version" instead.

> The only place where I can see some amount of investigation being
required would be for security issues or correctness issues.
Yes, I agree.

Yes, was there a particular case or context that motivated this thread?

2020년 4월 2일 (목) 오전 10:24, Mridul Muralidharan 님이 작성:

>
> I agree with what Sean detailed.
> The only place where I can see some amount of investigation being required
> would be for security issues or correctness issues.
> Knowing the affected versions, particularly if an earlier supported
> version does not have the bug, will help users understand the
> broken/insecure versions.
>
> Regards,
> Mridul
>
>
> On Wed, Apr 1, 2020 at 6:12 PM Sean Owen  wrote:
>
>> I think we discussed this briefly on a PR.
>>
>> It's not as clear what it means for an Improvement to 'affect a
>> version'. Certainly, an improvement to a feature introduced in 1.2.3
>> can't affect anything earlier, and implicitly affects everything
>> after. It's not wrong to say it affects the latest version, at least.
>> And I believe we require it in JIRA because we can't require an
>> Affects Version for one type of issue but not another. So, just asking
>> people to default to 'latest version' there is no burden.
>>
>> I would not ask someone to figure out all and earliest versions that
>> an Improvement applies to; it just isn't that useful. We aren't
>> generally going to back-port improvements anyway.
>>
>> Even for bugs, we don't really need to know that a bug in master
>> affects 2.4.5, 2.4.4, 2.4.3, ... 2.3.6, 2.3.5, etc. It doesn't hurt to
>> at least say it affects the latest 2.4.x, 2.3.x releases, if known,
>> because it's possible it should be back-ported. Again even where this
>> is significantly more useful, I'm not in favor of telling people they
>> must test the bug report vs previous releases.
>>
>> So, if you're asserting that the current guidance is OK, I generally
>> agree.
>> Is there a particular context where this was questioned? maybe we
>> should examine the particulars of that situation. As in all things,
>> context matters.
>>
>> Sean
>>
>> On Wed, Apr 1, 2020 at 7:34 PM Jungtaek Lim
>>  wrote:
>> >
>> > Hi devs,
>> >
>> > I know we're busy with making Spark 3.0 be out, but I think the topic
>> is good to discuss at any time and actually be better to be resolved sooner
>> than later.
>> >
>> > In the page "Contributing to Spark", we describe the guide of "affects
>> version" as "For Bugs, assign at least one version that is known to exhibit
>> the problem or need the change".
>> >
>> > For me, that sentence clearly describes minimal requirement of affects
>> version via:
>> >
>> > * For the type of bug, assign one valid version
>> > * For other types, there's no requirement
>> >
>> > but I'm seeing the requests more than the requirement which makes me
>> think there might be different understanding of the sentence. Maybe there's
>> more, but to summarize on such requests:
>> >
>> > 1) add affects version as same as master branch for improvement/new
>> feature
>> > 2) check with older versions to fill up affects version for bug
>> >
>> > I don't see any point on doing 1). It might give some context if we
>> don't update the affect version (so that it can say which version was
>> considered when filing JIRA issue) but we also update the affect version
>> when we bump the master branch, which is no longer informational as the
>> version should have been always the same as master branch.
>> >
>> > I agree it's ideal to do 2) but I think the reason the guide doesn't
>> enforce is that it requires pretty much efforts to check with old versions
>> (sometimes even more than origin work).
>> >
>> > Suppose the happy case we have UT to verify the bugfix which fails
>> without the patch and passes with the patch. To check with older versions
>> we have to checkout the tag, and apply the UT, and "rebuild", and run UT to
>> verify which is pretty much time-consuming. What if there's a conflict
>> indeed? That's still a happy case, and in worse case (there's no such UT)
>> we should do E2E manual verification which I would give up.
>> >
>> > There should have some balance/threshold, and the balance should be the
>> thing the community has a consensus.
>> >
>> > Would like to hear everyone's voice on this.
>> >
>> > Thanks,
>> > Jungtaek Lim (HeartSaVioR)
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [DISCUSS] filling affected versions on JIRA issue

2020-04-01 Thread Mridul Muralidharan
I agree with what Sean detailed.
The only place where I can see some amount of investigation being required
would be for security issues or correctness issues.
Knowing the affected versions, particularly if an earlier supported version
does not have the bug, will help users understand the broken/insecure
versions.

Regards,
Mridul


On Wed, Apr 1, 2020 at 6:12 PM Sean Owen  wrote:

> I think we discussed this briefly on a PR.
>
> It's not as clear what it means for an Improvement to 'affect a
> version'. Certainly, an improvement to a feature introduced in 1.2.3
> can't affect anything earlier, and implicitly affects everything
> after. It's not wrong to say it affects the latest version, at least.
> And I believe we require it in JIRA because we can't require an
> Affects Version for one type of issue but not another. So, just asking
> people to default to 'latest version' there is no burden.
>
> I would not ask someone to figure out all and earliest versions that
> an Improvement applies to; it just isn't that useful. We aren't
> generally going to back-port improvements anyway.
>
> Even for bugs, we don't really need to know that a bug in master
> affects 2.4.5, 2.4.4, 2.4.3, ... 2.3.6, 2.3.5, etc. It doesn't hurt to
> at least say it affects the latest 2.4.x, 2.3.x releases, if known,
> because it's possible it should be back-ported. Again even where this
> is significantly more useful, I'm not in favor of telling people they
> must test the bug report vs previous releases.
>
> So, if you're asserting that the current guidance is OK, I generally agree.
> Is there a particular context where this was questioned? maybe we
> should examine the particulars of that situation. As in all things,
> context matters.
>
> Sean
>
> On Wed, Apr 1, 2020 at 7:34 PM Jungtaek Lim
>  wrote:
> >
> > Hi devs,
> >
> > I know we're busy with making Spark 3.0 be out, but I think the topic is
> good to discuss at any time and actually be better to be resolved sooner
> than later.
> >
> > In the page "Contributing to Spark", we describe the guide of "affects
> version" as "For Bugs, assign at least one version that is known to exhibit
> the problem or need the change".
> >
> > For me, that sentence clearly describes minimal requirement of affects
> version via:
> >
> > * For the type of bug, assign one valid version
> > * For other types, there's no requirement
> >
> > but I'm seeing the requests more than the requirement which makes me
> think there might be different understanding of the sentence. Maybe there's
> more, but to summarize on such requests:
> >
> > 1) add affects version as same as master branch for improvement/new
> feature
> > 2) check with older versions to fill up affects version for bug
> >
> > I don't see any point on doing 1). It might give some context if we
> don't update the affect version (so that it can say which version was
> considered when filing JIRA issue) but we also update the affect version
> when we bump the master branch, which is no longer informational as the
> version should have been always the same as master branch.
> >
> > I agree it's ideal to do 2) but I think the reason the guide doesn't
> enforce is that it requires pretty much efforts to check with old versions
> (sometimes even more than origin work).
> >
> > Suppose the happy case we have UT to verify the bugfix which fails
> without the patch and passes with the patch. To check with older versions
> we have to checkout the tag, and apply the UT, and "rebuild", and run UT to
> verify which is pretty much time-consuming. What if there's a conflict
> indeed? That's still a happy case, and in worse case (there's no such UT)
> we should do E2E manual verification which I would give up.
> >
> > There should have some balance/threshold, and the balance should be the
> thing the community has a consensus.
> >
> > Would like to hear everyone's voice on this.
> >
> > Thanks,
> > Jungtaek Lim (HeartSaVioR)
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS] filling affected versions on JIRA issue

2020-04-01 Thread Sean Owen
I think we discussed this briefly on a PR.

It's not as clear what it means for an Improvement to 'affect a
version'. Certainly, an improvement to a feature introduced in 1.2.3
can't affect anything earlier, and implicitly affects everything
after. It's not wrong to say it affects the latest version, at least.
And I believe we require it in JIRA because we can't require an
Affects Version for one type of issue but not another. So, just asking
people to default to 'latest version' there is no burden.

I would not ask someone to figure out all and earliest versions that
an Improvement applies to; it just isn't that useful. We aren't
generally going to back-port improvements anyway.

Even for bugs, we don't really need to know that a bug in master
affects 2.4.5, 2.4.4, 2.4.3, ... 2.3.6, 2.3.5, etc. It doesn't hurt to
at least say it affects the latest 2.4.x, 2.3.x releases, if known,
because it's possible it should be back-ported. Again even where this
is significantly more useful, I'm not in favor of telling people they
must test the bug report vs previous releases.

So, if you're asserting that the current guidance is OK, I generally agree.
Is there a particular context where this was questioned? maybe we
should examine the particulars of that situation. As in all things,
context matters.

Sean

On Wed, Apr 1, 2020 at 7:34 PM Jungtaek Lim
 wrote:
>
> Hi devs,
>
> I know we're busy with making Spark 3.0 be out, but I think the topic is good 
> to discuss at any time and actually be better to be resolved sooner than 
> later.
>
> In the page "Contributing to Spark", we describe the guide of "affects 
> version" as "For Bugs, assign at least one version that is known to exhibit 
> the problem or need the change".
>
> For me, that sentence clearly describes minimal requirement of affects 
> version via:
>
> * For the type of bug, assign one valid version
> * For other types, there's no requirement
>
> but I'm seeing the requests more than the requirement which makes me think 
> there might be different understanding of the sentence. Maybe there's more, 
> but to summarize on such requests:
>
> 1) add affects version as same as master branch for improvement/new feature
> 2) check with older versions to fill up affects version for bug
>
> I don't see any point on doing 1). It might give some context if we don't 
> update the affect version (so that it can say which version was considered 
> when filing JIRA issue) but we also update the affect version when we bump 
> the master branch, which is no longer informational as the version should 
> have been always the same as master branch.
>
> I agree it's ideal to do 2) but I think the reason the guide doesn't enforce 
> is that it requires pretty much efforts to check with old versions (sometimes 
> even more than origin work).
>
> Suppose the happy case we have UT to verify the bugfix which fails without 
> the patch and passes with the patch. To check with older versions we have to 
> checkout the tag, and apply the UT, and "rebuild", and run UT to verify which 
> is pretty much time-consuming. What if there's a conflict indeed? That's 
> still a happy case, and in worse case (there's no such UT) we should do E2E 
> manual verification which I would give up.
>
> There should have some balance/threshold, and the balance should be the thing 
> the community has a consensus.
>
> Would like to hear everyone's voice on this.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[DISCUSS] filling affected versions on JIRA issue

2020-04-01 Thread Jungtaek Lim
Hi devs,

I know we're busy with making Spark 3.0 be out, but I think the topic is
good to discuss at any time and actually be better to be resolved sooner
than later.

In the page "Contributing to Spark", we describe the guide of "affects
version" as "For Bugs, assign at least one version that is known to exhibit
the problem or need the change".

For me, that sentence clearly describes minimal requirement of affects
version via:

* For the type of bug, assign one valid version
* For other types, there's no requirement

but I'm seeing the requests more than the requirement which makes me think
there might be different understanding of the sentence. Maybe there's more,
but to summarize on such requests:

1) add affects version as same as master branch for improvement/new feature
2) check with older versions to fill up affects version for bug

I don't see any point on doing 1). It might give some context if we don't
update the affect version (so that it can say which version was considered
when filing JIRA issue) but we also update the affect version when we bump
the master branch, which is no longer informational as the version should
have been always the same as master branch.

I agree it's ideal to do 2) but I think the reason the guide doesn't
enforce is that it requires pretty much efforts to check with old versions
(sometimes even more than origin work).

Suppose the happy case we have UT to verify the bugfix which fails without
the patch and passes with the patch. To check with older versions we have
to checkout the tag, and apply the UT, and "rebuild", and run UT to verify
which is pretty much time-consuming. What if there's a conflict indeed?
That's still a happy case, and in worse case (there's no such UT) we should
do E2E manual verification which I would give up.

There should have some balance/threshold, and the balance should be the
thing the community has a consensus.

Would like to hear everyone's voice on this.

Thanks,
Jungtaek Lim (HeartSaVioR)


Re: [VOTE] Apache Spark 3.0.0 RC1

2020-04-01 Thread Ryan Blue
-1 (non-binding)

I agree with Jungtaek. The change to create datasource tables instead of
Hive tables by default (no USING or STORED AS clauses) has created
confusing behavior and should either be rolled back or fixed before 3.0.

On Wed, Apr 1, 2020 at 5:12 AM Sean Owen  wrote:

> Those are not per se release blockers. They are (perhaps important)
> improvements to functionality. I don't know who is active and able to
> review that part of the code; I'd look for authors of changes in the
> surrounding code. The question here isn't so much what one would like
> to see in this release, but evaluating whether the release is sound
> and free of show-stopper problems. There will always be potentially
> important changes and fixes to come.
>
> On Wed, Apr 1, 2020 at 5:31 AM Dr. Kent Yao  wrote:
> >
> > -1
> > Do not release this package because v3.0.0 is the 3rd major release
> since we
> > added Spark On Kubernetes. Can we make it more production-ready as it has
> > been experimental for more than 2 years?
> >
> > The main practical adoption of Spark on Kubernetes is to take on the
> role of
> > other cluster managers(mainly YARN). And the storage layer(mainly HDFS)
> > would be more likely kept anyway. But Spark on Kubernetes with HDFS seems
> > not to work properly.
> >
> > e.g.
> > This ticket and PR were submitted 7 months ago, and never get reviewed.
> > https://issues.apache.org/jira/browse/SPARK-29974
> > https://issues.apache.org/jira/browse/SPARK-28992
> > https://github.com/apache/spark/pull/25695
> >
> > And this.
> > https://issues.apache.org/jira/browse/SPARK-28896
> > https://github.com/apache/spark/pull/25609
> >
> > In terms of how often this module is updated, it seems to be stable.
> > But in terms of how often PRs for this module are reviewed, it seems
> that it
> > will stay experimental for a long time.
> >
> > Thanks.
> >
> >
> >
> > --
> > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: Need to order iterator values in spark dataframe

2020-04-01 Thread Ranjan, Abhinav

Enrico,

The below solution works but there is a little glitch.

It is working fine in spark-shell but failing for *_/skewed keys/_* 
while doing a spark-submit.


while looking into the execution plan, the partitioning value is same 
for both repartition and groupByKey and is driven by the value 
"spark.sql.shuffle.partitions"


like: Exchange hashpartitioning(value#143, 200)

Any ideas on why is skewed keys giving wrong output while the same code 
giving correct in spark-shell?



--Abhinav

On 26/03/20 10:54 pm, Enrico Minack wrote:


Abhinav,

you can repartition by your key, then sortWithinPartition, and the 
groupByKey. Since data are already hash-partitioned by key, Spark 
should not shuffle the data hence change the sort wihtin each partition:


ds.repartition($"key").sortWithinPartitions($"code").groupBy($"key")

Enrico


Am 26.03.20 um 17:53 schrieb Ranjan, Abhinav:


Hi,

I have a dataframe which has data like:

key                         |    code    |    code_value
1                            |    c1        |    11
1                            |    c2        |    12
1                            |    c2        |    9
1                            |    c3        |    12
1                            |    c2        |    13
1                            |    c2        |    14
1                            |    c4        |    12
1                            |    c2        |    15
1                            |    c1        |    12


I need to group the data based on key and then apply some custom 
logic on every of the value I got by grouping. So I did this:


lets suppose it is in a dataframe df.

*case class key_class(key: string, code: string, code_value: string)*


df
.as[key_class]
.groupByKey(_.key)
.mapGroups {
  (x, groupedValues) =>
    val status = groupedValues.map(row => {
  // do some custom logic on row
  ("SUCCESS")
    }).toList

}.toDF("status")


The issue with above approach is the values I get after applying 
groupByKey are not sorted/ordered. I want the values to be sorted by 
the column 'code'.


There is a way to do this:

1. get them in a list and then apply sort ==> this will result in OOM 
if the iterartor is too big.


2. I think some how to apply the secondary sort, but problem with 
that approach is I have to keep track of the key change.


3. sortWithinPartitions cannot be applied because groupBy will mess 
up the order.


4. Another approach is:

df
.as[key_class]
.sort("key").sort("code")
.map {
 // do stuff here
}

but here also I have to keep track of the key change within map 
function, and sometimes this also overflows if the keys are skewed.



_/*So is there any way in which I can get the values sorted after 
grouping them by a key.??*/_


_/*
*/_

_/*Thanks,*/_

_/*Abhinav
*/_





Re: [VOTE] Apache Spark 3.0.0 RC1

2020-04-01 Thread Sean Owen
Those are not per se release blockers. They are (perhaps important)
improvements to functionality. I don't know who is active and able to
review that part of the code; I'd look for authors of changes in the
surrounding code. The question here isn't so much what one would like
to see in this release, but evaluating whether the release is sound
and free of show-stopper problems. There will always be potentially
important changes and fixes to come.

On Wed, Apr 1, 2020 at 5:31 AM Dr. Kent Yao  wrote:
>
> -1
> Do not release this package because v3.0.0 is the 3rd major release since we
> added Spark On Kubernetes. Can we make it more production-ready as it has
> been experimental for more than 2 years?
>
> The main practical adoption of Spark on Kubernetes is to take on the role of
> other cluster managers(mainly YARN). And the storage layer(mainly HDFS)
> would be more likely kept anyway. But Spark on Kubernetes with HDFS seems
> not to work properly.
>
> e.g.
> This ticket and PR were submitted 7 months ago, and never get reviewed.
> https://issues.apache.org/jira/browse/SPARK-29974
> https://issues.apache.org/jira/browse/SPARK-28992
> https://github.com/apache/spark/pull/25695
>
> And this.
> https://issues.apache.org/jira/browse/SPARK-28896
> https://github.com/apache/spark/pull/25609
>
> In terms of how often this module is updated, it seems to be stable.
> But in terms of how often PRs for this module are reviewed, it seems that it
> will stay experimental for a long time.
>
> Thanks.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Apache Spark 3.0.0 RC1

2020-04-01 Thread Dr. Kent Yao
-1
Do not release this package because v3.0.0 is the 3rd major release since we
added Spark On Kubernetes. Can we make it more production-ready as it has
been experimental for more than 2 years? 

The main practical adoption of Spark on Kubernetes is to take on the role of
other cluster managers(mainly YARN). And the storage layer(mainly HDFS)
would be more likely kept anyway. But Spark on Kubernetes with HDFS seems
not to work properly.

e.g.
This ticket and PR were submitted 7 months ago, and never get reviewed.
https://issues.apache.org/jira/browse/SPARK-29974
https://issues.apache.org/jira/browse/SPARK-28992
https://github.com/apache/spark/pull/25695

And this.
https://issues.apache.org/jira/browse/SPARK-28896
https://github.com/apache/spark/pull/25609

In terms of how often this module is updated, it seems to be stable. 
But in terms of how often PRs for this module are reviewed, it seems that it
will stay experimental for a long time.

Thanks.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org