Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-18 Thread Hyukjin Kwon
I struggled hard to deal with this issue multiple times over a year and
thankfully we finally
decided to use the official version of Hive 2.3.x too (thank you, Yuming,
Alan, and guys)
I think this is already a huge progress that we started to use the
official version of Hive.

I think we should at least have one minor release term to let users test
out Spark with Hive 2.3.x. before switching this
as a default. My impression was it's the decision made before at:
http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Upgrade-built-in-Hive-to-2-3-4-td26153.html

How about we try to make it default in Spark 3.1 by using this thread as a
reference? I think it's too a radical change.


2019년 11월 19일 (화) 오후 2:11, Dongjoon Hyun 님이 작성:

> Hi, All.
>
> First of all, I want to put this as a policy issue instead of a technical
> issue.
> Also, this is orthogonal from `hadoop` version discussion.
>
> Apache Spark community kept (not maintained) the forked Apache Hive
> 1.2.1 because there has been no other options before. As we see at
> SPARK-20202, it's not a desirable situation among the Apache projects.
>
> https://issues.apache.org/jira/browse/SPARK-20202
>
> Also, please note that we `kept`, not `maintained`, because we know it's
> not good.
> There are several attempt to update that forked repository
> for several reasons (Hadoop 3 support is one of the example),
> but those attempts are also turned down.
>
> From Apache Spark 3.0, it seems that we have a new feasible option
> `hive-2.3` profile. What about moving forward in this direction further?
>
> For example, can we remove the usage of forked `hive` in Apache Spark 3.0
> completely officially? If someone still needs to use the forked `hive`, we
> can
> have a profile `hive-1.2`. Of course, it should not be a default profile
> in the community.
>
> I want to say this is a goal we should achieve someday.
> If we don't do anything, nothing happen. At least we need to prepare this.
> Without any preparation, Spark 3.1+ will be the same.
>
> Shall we focus on what are our problems with Hive 2.3.6?
> If the only reason is that we didn't use it before, we can release
> another
> `3.0.0-preview` for that.
>
> Bests,
> Dongjoon.
>


Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-18 Thread Dongjoon Hyun
Hi, All.

First of all, I want to put this as a policy issue instead of a technical
issue.
Also, this is orthogonal from `hadoop` version discussion.

Apache Spark community kept (not maintained) the forked Apache Hive
1.2.1 because there has been no other options before. As we see at
SPARK-20202, it's not a desirable situation among the Apache projects.

https://issues.apache.org/jira/browse/SPARK-20202

Also, please note that we `kept`, not `maintained`, because we know it's
not good.
There are several attempt to update that forked repository
for several reasons (Hadoop 3 support is one of the example),
but those attempts are also turned down.

>From Apache Spark 3.0, it seems that we have a new feasible option
`hive-2.3` profile. What about moving forward in this direction further?

For example, can we remove the usage of forked `hive` in Apache Spark 3.0
completely officially? If someone still needs to use the forked `hive`, we
can
have a profile `hive-1.2`. Of course, it should not be a default profile in
the community.

I want to say this is a goal we should achieve someday.
If we don't do anything, nothing happen. At least we need to prepare this.
Without any preparation, Spark 3.1+ will be the same.

Shall we focus on what are our problems with Hive 2.3.6?
If the only reason is that we didn't use it before, we can release another
`3.0.0-preview` for that.

Bests,
Dongjoon.


Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-18 Thread Dongjoon Hyun
I also agree with Steve and Felix.

Let's have another thread to discuss Hive issue

because this thread was originally for `hadoop` version.

And, now we can have `hive-2.3` profile for both `hadoop-2.7` and
`hadoop-3.0` versions.

We don't need to mix both.

Bests,
Dongjoon.


On Mon, Nov 18, 2019 at 8:19 PM Felix Cheung 
wrote:

> 1000% with Steve, the org.spark-project hive 1.2 will need a solution. It
> is old and rather buggy; and It’s been *years*
>
> I think we should decouple hive change from everything else if people are
> concerned?
>
> --
> *From:* Steve Loughran 
> *Sent:* Sunday, November 17, 2019 9:22:09 AM
> *To:* Cheng Lian 
> *Cc:* Sean Owen ; Wenchen Fan ;
> Dongjoon Hyun ; dev ;
> Yuming Wang 
> *Subject:* Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
>
> Can I take this moment to remind everyone that the version of hive which
> spark has historically bundled (the org.spark-project one) is an orphan
> project put together to deal with Hive's shading issues and a source of
> unhappiness in the Hive project. What ever get shipped should do its best
> to avoid including that file.
>
> Postponing a switch to hadoop 3.x after spark 3.0 is probably the safest
> move from a risk minimisation perspective. If something has broken then it
> is you can start with the assumption that it is in the o.a.s packages
> without having to debug o.a.hadoop and o.a.hive first. There is a cost: if
> there are problems with the hadoop / hive dependencies those teams will
> inevitably ignore filed bug reports for the same reason spark team will
> probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for the
> Hadoop 2.x line include any compatibility issues with Java 9+. Do bear that
> in mind. It's not been tested, it has dependencies on artifacts we know are
> incompatible, and as far as the Hadoop project is concerned: people should
> move to branch 3 if they want to run on a modern version of Java
>
> It would be really really good if the published spark maven artefacts (a)
> included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop 3.x.
> That way people doing things with their own projects will get up-to-date
> dependencies and don't get WONTFIX responses themselves.
>
> -Steve
>
> PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last ever"
> branch-2 release and then declare its predecessors EOL; 2.10 will be the
> transition release.
>
> On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian  wrote:
>
> Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I
> thought the original proposal was to replace Hive 1.2 with Hive 2.3, which
> seemed risky, and therefore we only introduced Hive 2.3 under the
> hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong
> here...
>
> Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed that
> Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not
> about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11
> upgrade together looks too risky.
>
> On Sat, Nov 16, 2019 at 4:03 AM Sean Owen  wrote:
>
> I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
> than introduce yet another build combination. Does Hadoop 2 + Hive 2
> work and is there demand for it?
>
> On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan  wrote:
> >
> > Do we have a limitation on the number of pre-built distributions? Seems
> this time we need
> > 1. hadoop 2.7 + hive 1.2
> > 2. hadoop 2.7 + hive 2.3
> > 3. hadoop 3 + hive 2.3
> >
> > AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so
> don't need to add JDK version to the combination.
> >
> > On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun 
> wrote:
> >>
> >> Thank you for suggestion.
> >>
> >> Having `hive-2.3` profile sounds good to me because it's orthogonal to
> Hadoop 3.
> >> IIRC, originally, it was proposed in that way, but we put it under
> `hadoop-3.2` to avoid adding new profiles at that time.
> >>
> >> And, I'm wondering if you are considering additional pre-built
> distribution and Jenkins jobs.
> >>
> >> Bests,
> >> Dongjoon.
> >>
>
>


Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-18 Thread Felix Cheung
1000% with Steve, the org.spark-project hive 1.2 will need a solution. It is 
old and rather buggy; and It’s been *years*

I think we should decouple hive change from everything else if people are 
concerned?


From: Steve Loughran 
Sent: Sunday, November 17, 2019 9:22:09 AM
To: Cheng Lian 
Cc: Sean Owen ; Wenchen Fan ; Dongjoon 
Hyun ; dev ; Yuming Wang 

Subject: Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Can I take this moment to remind everyone that the version of hive which spark 
has historically bundled (the org.spark-project one) is an orphan project put 
together to deal with Hive's shading issues and a source of unhappiness in the 
Hive project. What ever get shipped should do its best to avoid including that 
file.

Postponing a switch to hadoop 3.x after spark 3.0 is probably the safest move 
from a risk minimisation perspective. If something has broken then it is you 
can start with the assumption that it is in the o.a.s packages without having 
to debug o.a.hadoop and o.a.hive first. There is a cost: if there are problems 
with the hadoop / hive dependencies those teams will inevitably ignore filed 
bug reports for the same reason spark team will probably because 1.6-related 
JIRAs as WONTFIX. WONTFIX responses for the Hadoop 2.x line include any 
compatibility issues with Java 9+. Do bear that in mind. It's not been tested, 
it has dependencies on artifacts we know are incompatible, and as far as the 
Hadoop project is concerned: people should move to branch 3 if they want to run 
on a modern version of Java

It would be really really good if the published spark maven artefacts (a) 
included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop 3.x. 
That way people doing things with their own projects will get up-to-date 
dependencies and don't get WONTFIX responses themselves.

-Steve

PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last ever" 
branch-2 release and then declare its predecessors EOL; 2.10 will be the 
transition release.

On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian 
mailto:lian.cs@gmail.com>> wrote:
Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I thought 
the original proposal was to replace Hive 1.2 with Hive 2.3, which seemed 
risky, and therefore we only introduced Hive 2.3 under the hadoop-3.2 profile 
without removing Hive 1.2. But maybe I'm totally wrong here...

Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed that Hadoop 
2 + Hive 2 + JDK 11 looks promising. My major motivation is not about demand, 
but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11 upgrade together 
looks too risky.

On Sat, Nov 16, 2019 at 4:03 AM Sean Owen 
mailto:sro...@gmail.com>> wrote:
I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
than introduce yet another build combination. Does Hadoop 2 + Hive 2
work and is there demand for it?

On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
>
> Do we have a limitation on the number of pre-built distributions? Seems this 
> time we need
> 1. hadoop 2.7 + hive 1.2
> 2. hadoop 2.7 + hive 2.3
> 3. hadoop 3 + hive 2.3
>
> AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so don't 
> need to add JDK version to the combination.
>
> On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun 
> mailto:dongjoon.h...@gmail.com>> wrote:
>>
>> Thank you for suggestion.
>>
>> Having `hive-2.3` profile sounds good to me because it's orthogonal to 
>> Hadoop 3.
>> IIRC, originally, it was proposed in that way, but we put it under 
>> `hadoop-3.2` to avoid adding new profiles at that time.
>>
>> And, I'm wondering if you are considering additional pre-built distribution 
>> and Jenkins jobs.
>>
>> Bests,
>> Dongjoon.
>>


Re: Adding JIRA ID as the prefix for the test case name

2019-11-18 Thread Hyukjin Kwon
Let me document as below in few days:

1. For Python and Java, write a single comment that starts with JIRA ID and
short description, e.g. (SPARK-X: test blah blah)
2. For R, use JIRA ID as a prefix for its test name.

assuming everybody is happy.

2019년 11월 18일 (월) 오전 11:36, Hyukjin Kwon 님이 작성:

> Actually there are not so many Java test cases in Spark (because Scala
> runs on JVM as everybody knows)[1].
>
> Given that, I think we can avoid to put some efforts on this for now .. I
> don't mind if somebody wants to give a shot since it looks good anyway but
> to me I wouldn't spend so much time on this ..
>
> Let me just go ahead as I suggested if you don't mind. Anyone can give a
> shot for Display Name - I'm willing to actively review and help.
>
> [1]
> git ls-files '*Suite.java' | wc -l
>  172
> git ls-files '*Suite.scala' | wc -l
> 1161
>
> 2019년 11월 18일 (월) 오전 3:27, Steve Loughran 님이 작성:
>
>> Test reporters do often contain some assumptions about the characters in
>> the test methods. Historically JUnit XML reporters have never sanitised the
>> method names so XML injection attacks have been fairly trivial. Haven't
>> tried this for a while.
>>
>> That whole JUnit XML report "standard" was actually put together in the
>> Ant project with  doing the postprocessing of the JUnit run.
>> It was driven by the team's XSL skills than any overreaching strategic goal
>> about how to present test results of tests which could run for hours and
>> whose output you would really want to aggregate the locks from multiple
>> machines and processes and present in awake you can actually navigate. With
>> hindsight, a key failing is that we chose to store the test summaries (test
>> count, failure count...) as attributes on the root XML mode. Which is why
>> the whole DOM gets built up in the JUnit runner. Which is why when that
>> JUnit process crashes, you get no report at all.
>>
>> It'd be straightforward to fix -except too much relies on that file
>> now...important things will break. And the maven runner has historically
>> never supported custom reporters, to let you experiment with it.
>>
>> Maybe this is an opportunity to change things.
>>
>> On Sun, Nov 17, 2019 at 1:42 AM Hyukjin Kwon  wrote:
>>
>>> DisplayName looks good in general but actually here I would like first
>>> to find a existing pattern to document in guidelines given the actual
>>> existing practice we all are used to. I'm trying to be very conservative
>>> since this guidelines affect everybody.
>>>
>>> I think it might be better to discuss separately if we want to change
>>> what we have been used to.
>>>
>>> Also, using arbitrary names might not be actually free due to such bug
>>> like https://github.com/apache/spark/pull/25630 . It will need some
>>> more efforts to investigate as well.
>>>
>>> On Fri, 15 Nov 2019, 20:56 Steve Loughran, 
>>> wrote:
>>>
  Junit5: Display names.

 Goes all the way to the XML.


 https://junit.org/junit5/docs/current/user-guide/#writing-tests-display-names

 On Thu, Nov 14, 2019 at 6:13 PM Shixiong(Ryan) Zhu <
 shixi...@databricks.com> wrote:

> Should we also add a guideline for non Scala tests? Other languages
> (Java, Python, R) don't support using string as a test name.
>
> Best Regards,
> Ryan
>
>
> On Thu, Nov 14, 2019 at 4:04 AM Hyukjin Kwon 
> wrote:
>
>> I opened a PR - https://github.com/apache/spark-website/pull/231
>>
>> 2019년 11월 13일 (수) 오전 10:43, Hyukjin Kwon 님이 작성:
>>
>>> > In general a test should be self descriptive and I don't think we
>>> should be adding JIRA ticket references wholesale. Any action that the
>>> reader has to take to understand why a test was introduced is one too 
>>> many.
>>> However in some cases the thing we are trying to test is very subtle 
>>> and in
>>> that case a reference to a JIRA ticket might be useful, I do still feel
>>> that this should be a backstop and that properly documenting your tests 
>>> is
>>> a much better way of dealing with this.
>>>
>>> Yeah, the test should be self-descriptive. I don't think adding a
>>> JIRA prefix harms this point. Probably I should add this sentence in the
>>> guidelines as well.
>>> Adding a JIRA prefix just adds one extra hint to track down details.
>>> I think it's fine to stick to this practice and make it simpler and 
>>> clear
>>> to follow.
>>>
>>> > 1. what if multiple JIRA IDs relating to the same test? we just
>>> take the very first JIRA ID?
>>> Ideally one JIRA should describe one issue and one PR should fix one
>>> JIRA with a dedicated test.
>>> Yeah, I think I would take the very first JIRA ID.
>>>
>>> > 2. are we going to have a full scan of all existing tests and
>>> attach a JIRA ID to it?
>>> Yea, let's don't do this.
>>>
>>> > It's a nice-to-have, not super essential, just 

CR for adding bucket join support to V2 Datasources

2019-11-18 Thread Long, Andrew
Hey Friends,

I recently created a pull request to add an optional support for bucket joins 
to V2 Datasources, via a concrete class representing the Spark Style ash 
Partitioning. If anyone has some free time Id appreciate a code review.  This 
also adds a concrete implementation of V2 ClusteredDistribution to make 
specifying Clustered Distributionseasier.

https://github.com/apache/spark/pull/26511

Cheers Andrew