Re: Ask for ARM CI for spark

2019-11-17 Thread Tianhua huang
We can talk about this later, but I have to update some things:)

- It (largely) worked previously
  --- But no one sure about this before the arm testing, and it can't be
found anywhere, specify officially will make it more clear
- I think you're also saying you don't have 100% tests passing anyway,
though probably just small issues
  --- The maven and python tests are 100% passing, see
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ and
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-python-arm/
- It does not seem to merit a special announcement from the PMC among
the 2000+ changes in Spark 3
  --- It's import to users, I believe it deserves

On Mon, Nov 18, 2019 at 10:06 AM Sean Owen  wrote:

> Same response as before:
>
> - It is in the list of resolved JIRAs, of course
> - It (largely) worked previously
> - I think you're also saying you don't have 100% tests passing anyway,
> though probably just small issues
> - It does not seem to merit a special announcement from the PMC among
> the 2000+ changes in Spark 3
> - You are welcome to announce (on the project's user@ list if you
> like) whatever you want. Obviously, this is already well advertised on
> dev@
>
> I think you are asking for what borders on endorsement, and no that
> doesn't sound appropriate. Please just announce whatever you like as
> suggested.
>
> Sean
>
> On Sun, Nov 17, 2019 at 8:01 PM Tianhua huang 
> wrote:
> >
> > @Sean Owen,
> > I'm afraid I don't agree with you this time, I still remember no one can
> tell me whether Spark supports ARM or how much Spark can support ARM when I
> asked this first time on Dev@,  you're very kind and told me to build and
> test on ARM locally and so sorry I think you were not sure much about this
> at that moment, right? Then I and my team work with community, we
> found/fixed several issues, integrate arm jobs into AMPLAB Jenkins, and the
> daily jobs has been stablely running for few weeks... after these efforts
> why not announce this officially in Spark releasenote? I believe after this
> everyone will know Spark is fully testing on ARM on community CI, Spark
> supports ARM basically, it's amazing and this will be very helpful. So what
> do you think? Or what are you worrying about?
>


Re: Adding JIRA ID as the prefix for the test case name

2019-11-17 Thread Hyukjin Kwon
Actually there are not so many Java test cases in Spark (because Scala runs
on JVM as everybody knows)[1].

Given that, I think we can avoid to put some efforts on this for now .. I
don't mind if somebody wants to give a shot since it looks good anyway but
to me I wouldn't spend so much time on this ..

Let me just go ahead as I suggested if you don't mind. Anyone can give a
shot for Display Name - I'm willing to actively review and help.

[1]
git ls-files '*Suite.java' | wc -l
 172
git ls-files '*Suite.scala' | wc -l
1161

2019년 11월 18일 (월) 오전 3:27, Steve Loughran 님이 작성:

> Test reporters do often contain some assumptions about the characters in
> the test methods. Historically JUnit XML reporters have never sanitised the
> method names so XML injection attacks have been fairly trivial. Haven't
> tried this for a while.
>
> That whole JUnit XML report "standard" was actually put together in the
> Ant project with  doing the postprocessing of the JUnit run.
> It was driven by the team's XSL skills than any overreaching strategic goal
> about how to present test results of tests which could run for hours and
> whose output you would really want to aggregate the locks from multiple
> machines and processes and present in awake you can actually navigate. With
> hindsight, a key failing is that we chose to store the test summaries (test
> count, failure count...) as attributes on the root XML mode. Which is why
> the whole DOM gets built up in the JUnit runner. Which is why when that
> JUnit process crashes, you get no report at all.
>
> It'd be straightforward to fix -except too much relies on that file
> now...important things will break. And the maven runner has historically
> never supported custom reporters, to let you experiment with it.
>
> Maybe this is an opportunity to change things.
>
> On Sun, Nov 17, 2019 at 1:42 AM Hyukjin Kwon  wrote:
>
>> DisplayName looks good in general but actually here I would like first to
>> find a existing pattern to document in guidelines given the actual existing
>> practice we all are used to. I'm trying to be very conservative since this
>> guidelines affect everybody.
>>
>> I think it might be better to discuss separately if we want to change
>> what we have been used to.
>>
>> Also, using arbitrary names might not be actually free due to such bug
>> like https://github.com/apache/spark/pull/25630 . It will need some more
>> efforts to investigate as well.
>>
>> On Fri, 15 Nov 2019, 20:56 Steve Loughran, 
>> wrote:
>>
>>>  Junit5: Display names.
>>>
>>> Goes all the way to the XML.
>>>
>>>
>>> https://junit.org/junit5/docs/current/user-guide/#writing-tests-display-names
>>>
>>> On Thu, Nov 14, 2019 at 6:13 PM Shixiong(Ryan) Zhu <
>>> shixi...@databricks.com> wrote:
>>>
 Should we also add a guideline for non Scala tests? Other languages
 (Java, Python, R) don't support using string as a test name.

 Best Regards,
 Ryan


 On Thu, Nov 14, 2019 at 4:04 AM Hyukjin Kwon 
 wrote:

> I opened a PR - https://github.com/apache/spark-website/pull/231
>
> 2019년 11월 13일 (수) 오전 10:43, Hyukjin Kwon 님이 작성:
>
>> > In general a test should be self descriptive and I don't think we
>> should be adding JIRA ticket references wholesale. Any action that the
>> reader has to take to understand why a test was introduced is one too 
>> many.
>> However in some cases the thing we are trying to test is very subtle and 
>> in
>> that case a reference to a JIRA ticket might be useful, I do still feel
>> that this should be a backstop and that properly documenting your tests 
>> is
>> a much better way of dealing with this.
>>
>> Yeah, the test should be self-descriptive. I don't think adding a
>> JIRA prefix harms this point. Probably I should add this sentence in the
>> guidelines as well.
>> Adding a JIRA prefix just adds one extra hint to track down details.
>> I think it's fine to stick to this practice and make it simpler and clear
>> to follow.
>>
>> > 1. what if multiple JIRA IDs relating to the same test? we just
>> take the very first JIRA ID?
>> Ideally one JIRA should describe one issue and one PR should fix one
>> JIRA with a dedicated test.
>> Yeah, I think I would take the very first JIRA ID.
>>
>> > 2. are we going to have a full scan of all existing tests and
>> attach a JIRA ID to it?
>> Yea, let's don't do this.
>>
>> > It's a nice-to-have, not super essential, just because ...
>> It's been asked multiple times and each committer seems having a
>> different understanding on this.
>> It's not a biggie but wanted to make it clear and conclude this.
>>
>> > I'd add this only when a test specifically targets a certain issue.
>> Yes, so this one I am not sure. From what I heard, people adds the
>> JIRA in cases below:
>>
>> - Whenever the JIRA type is a bug
>> - 

Re: Ask for ARM CI for spark

2019-11-17 Thread Sean Owen
Same response as before:

- It is in the list of resolved JIRAs, of course
- It (largely) worked previously
- I think you're also saying you don't have 100% tests passing anyway,
though probably just small issues
- It does not seem to merit a special announcement from the PMC among
the 2000+ changes in Spark 3
- You are welcome to announce (on the project's user@ list if you
like) whatever you want. Obviously, this is already well advertised on
dev@

I think you are asking for what borders on endorsement, and no that
doesn't sound appropriate. Please just announce whatever you like as
suggested.

Sean

On Sun, Nov 17, 2019 at 8:01 PM Tianhua huang  wrote:
>
> @Sean Owen,
> I'm afraid I don't agree with you this time, I still remember no one can tell 
> me whether Spark supports ARM or how much Spark can support ARM when I asked 
> this first time on Dev@,  you're very kind and told me to build and test on 
> ARM locally and so sorry I think you were not sure much about this at that 
> moment, right? Then I and my team work with community, we found/fixed several 
> issues, integrate arm jobs into AMPLAB Jenkins, and the daily jobs has been 
> stablely running for few weeks... after these efforts why not announce this 
> officially in Spark releasenote? I believe after this everyone will know 
> Spark is fully testing on ARM on community CI, Spark supports ARM basically, 
> it's amazing and this will be very helpful. So what do you think? Or what are 
> you worrying about?

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Ask for ARM CI for spark

2019-11-17 Thread Tianhua huang
@Sean Owen ,
I'm afraid I don't agree with you this time, I still remember no one can
tell me whether Spark supports ARM or how much Spark can support ARM when I
asked this first time on Dev@,  you're very kind and told me to build and
test on ARM locally and so sorry I think you were not sure much about this
at that moment, right? Then I and my team work with community, we
found/fixed several issues, integrate arm jobs into AMPLAB Jenkins, and the
daily jobs has been stablely running for few weeks... after these efforts
why not announce this officially in Spark releasenote? I believe after this
everyone will know Spark is fully testing on ARM on community CI, Spark
supports ARM basically, it's amazing and this will be very helpful. So what
do you think? Or what are you worrying about?

On Mon, Nov 18, 2019 at 2:28 AM Steve Loughran  wrote:

> The ASF PR team would like something like that "Spark now supports ARM" in
> press releases. And don't forget: they do you like to be involved in the
> launch of the final release.
>
> On Fri, Nov 15, 2019 at 9:46 AM bo zhaobo 
> wrote:
>
>> Hi @Sean Owen  ,
>>
>> Thanks for your idea.
>>
>> We may use the bad words to describe our request. That's true that we
>> cannot just say "Spark support ARM from release 3.0.0", and we also cannot
>> say the past releases cannot run on ARM. But the reality is the past
>> releases didn't get a fully test on ARM like the current testing we do. And
>> that's true that current CI system have no resources can fit this kind
>> request(test on ARM).
>>
>> And please try to think, if a user wants to run lastest Spark release on
>> ARM(even the old releases), but community doesn't say that the specific
>> Spark release get testing on ARM. I think the users might think there is a
>> risk run on ARM, if he/she has no choice, they have to run spark on ARM,
>> they will build the CI system by themselves. That's very expensive. Right?
>> But now, community will do the same testing on ARM in the upstream, this
>> will save the users' resources. That's the reason announcing by community
>> in some ways is official and the best. Such as "In XXX release, Spark gets
>> fully testing on ARM" or "In XXX release, Spark community integrated an ARM
>> CI system. ". Once user see that, he/she would be very comfortable to use
>> Spark on ARM. ;-)
>>
>> Thanks for your paitent, we just discuss here, if I do something not
>> good, please feel free to correct and discuss. ;-)
>>
>> Thanks,
>>
>> BR
>>
>> ZhaoBo
>>
>>
>>
>>
>> [image: Mailtrack]
>> 
>>  Sender
>> notified by
>> Mailtrack
>> 
>>  19/11/15
>> 下午05:43:57
>>
>> Sean Owen  于2019年11月15日周五 下午5:04写道:
>>
>>> I don't think that's true either, not yet. Being JVM-based with no
>>> native code, I just don't even think it would be common to assume it
>>> doesn't work and it apparently has. If you want to announce it, that's
>>> up to you.
>>>
>>> On Fri, Nov 15, 2019 at 3:01 AM Tianhua huang 
>>> wrote:
>>> >
>>> > @Sean Owen,
>>> > Thanks for attention this.
>>> > I agree with you, it's probably not very appropriate to say 'support
>>> arm from 3.0 release'. How about change to the word "Spark community
>>> supports fully tests on arm from 3.0 release"?
>>> > Let's try to think about it from the user's point of view than
>>> developer,users have to know exactly whether spark supports arm well and
>>> wheter spark fully tests on arm. If we specify spark is fully tests on arm,
>>> I believe users will have much more confidence to run spark on arm.
>>> >
>>>
>>


Re: Ask for ARM CI for spark

2019-11-17 Thread Steve Loughran
The ASF PR team would like something like that "Spark now supports ARM" in
press releases. And don't forget: they do you like to be involved in the
launch of the final release.

On Fri, Nov 15, 2019 at 9:46 AM bo zhaobo 
wrote:

> Hi @Sean Owen  ,
>
> Thanks for your idea.
>
> We may use the bad words to describe our request. That's true that we
> cannot just say "Spark support ARM from release 3.0.0", and we also cannot
> say the past releases cannot run on ARM. But the reality is the past
> releases didn't get a fully test on ARM like the current testing we do. And
> that's true that current CI system have no resources can fit this kind
> request(test on ARM).
>
> And please try to think, if a user wants to run lastest Spark release on
> ARM(even the old releases), but community doesn't say that the specific
> Spark release get testing on ARM. I think the users might think there is a
> risk run on ARM, if he/she has no choice, they have to run spark on ARM,
> they will build the CI system by themselves. That's very expensive. Right?
> But now, community will do the same testing on ARM in the upstream, this
> will save the users' resources. That's the reason announcing by community
> in some ways is official and the best. Such as "In XXX release, Spark gets
> fully testing on ARM" or "In XXX release, Spark community integrated an ARM
> CI system. ". Once user see that, he/she would be very comfortable to use
> Spark on ARM. ;-)
>
> Thanks for your paitent, we just discuss here, if I do something not good,
> please feel free to correct and discuss. ;-)
>
> Thanks,
>
> BR
>
> ZhaoBo
>
>
>
>
> [image: Mailtrack]
> 
>  Sender
> notified by
> Mailtrack
> 
>  19/11/15
> 下午05:43:57
>
> Sean Owen  于2019年11月15日周五 下午5:04写道:
>
>> I don't think that's true either, not yet. Being JVM-based with no
>> native code, I just don't even think it would be common to assume it
>> doesn't work and it apparently has. If you want to announce it, that's
>> up to you.
>>
>> On Fri, Nov 15, 2019 at 3:01 AM Tianhua huang 
>> wrote:
>> >
>> > @Sean Owen,
>> > Thanks for attention this.
>> > I agree with you, it's probably not very appropriate to say 'support
>> arm from 3.0 release'. How about change to the word "Spark community
>> supports fully tests on arm from 3.0 release"?
>> > Let's try to think about it from the user's point of view than
>> developer,users have to know exactly whether spark supports arm well and
>> wheter spark fully tests on arm. If we specify spark is fully tests on arm,
>> I believe users will have much more confidence to run spark on arm.
>> >
>>
>


Re: Adding JIRA ID as the prefix for the test case name

2019-11-17 Thread Steve Loughran
Test reporters do often contain some assumptions about the characters in
the test methods. Historically JUnit XML reporters have never sanitised the
method names so XML injection attacks have been fairly trivial. Haven't
tried this for a while.

That whole JUnit XML report "standard" was actually put together in the Ant
project with  doing the postprocessing of the JUnit run. It
was driven by the team's XSL skills than any overreaching strategic goal
about how to present test results of tests which could run for hours and
whose output you would really want to aggregate the locks from multiple
machines and processes and present in awake you can actually navigate. With
hindsight, a key failing is that we chose to store the test summaries (test
count, failure count...) as attributes on the root XML mode. Which is why
the whole DOM gets built up in the JUnit runner. Which is why when that
JUnit process crashes, you get no report at all.

It'd be straightforward to fix -except too much relies on that file
now...important things will break. And the maven runner has historically
never supported custom reporters, to let you experiment with it.

Maybe this is an opportunity to change things.

On Sun, Nov 17, 2019 at 1:42 AM Hyukjin Kwon  wrote:

> DisplayName looks good in general but actually here I would like first to
> find a existing pattern to document in guidelines given the actual existing
> practice we all are used to. I'm trying to be very conservative since this
> guidelines affect everybody.
>
> I think it might be better to discuss separately if we want to change what
> we have been used to.
>
> Also, using arbitrary names might not be actually free due to such bug
> like https://github.com/apache/spark/pull/25630 . It will need some more
> efforts to investigate as well.
>
> On Fri, 15 Nov 2019, 20:56 Steve Loughran, 
> wrote:
>
>>  Junit5: Display names.
>>
>> Goes all the way to the XML.
>>
>>
>> https://junit.org/junit5/docs/current/user-guide/#writing-tests-display-names
>>
>> On Thu, Nov 14, 2019 at 6:13 PM Shixiong(Ryan) Zhu <
>> shixi...@databricks.com> wrote:
>>
>>> Should we also add a guideline for non Scala tests? Other languages
>>> (Java, Python, R) don't support using string as a test name.
>>>
>>> Best Regards,
>>> Ryan
>>>
>>>
>>> On Thu, Nov 14, 2019 at 4:04 AM Hyukjin Kwon 
>>> wrote:
>>>
 I opened a PR - https://github.com/apache/spark-website/pull/231

 2019년 11월 13일 (수) 오전 10:43, Hyukjin Kwon 님이 작성:

> > In general a test should be self descriptive and I don't think we
> should be adding JIRA ticket references wholesale. Any action that the
> reader has to take to understand why a test was introduced is one too 
> many.
> However in some cases the thing we are trying to test is very subtle and 
> in
> that case a reference to a JIRA ticket might be useful, I do still feel
> that this should be a backstop and that properly documenting your tests is
> a much better way of dealing with this.
>
> Yeah, the test should be self-descriptive. I don't think adding a JIRA
> prefix harms this point. Probably I should add this sentence in the
> guidelines as well.
> Adding a JIRA prefix just adds one extra hint to track down details. I
> think it's fine to stick to this practice and make it simpler and clear to
> follow.
>
> > 1. what if multiple JIRA IDs relating to the same test? we just take
> the very first JIRA ID?
> Ideally one JIRA should describe one issue and one PR should fix one
> JIRA with a dedicated test.
> Yeah, I think I would take the very first JIRA ID.
>
> > 2. are we going to have a full scan of all existing tests and attach
> a JIRA ID to it?
> Yea, let's don't do this.
>
> > It's a nice-to-have, not super essential, just because ...
> It's been asked multiple times and each committer seems having a
> different understanding on this.
> It's not a biggie but wanted to make it clear and conclude this.
>
> > I'd add this only when a test specifically targets a certain issue.
> Yes, so this one I am not sure. From what I heard, people adds the
> JIRA in cases below:
>
> - Whenever the JIRA type is a bug
> - When a PR adds a couple of tests
> - Only when a test specifically targets a certain issue.
> - ...
>
> Which one do we prefer and simpler to follow?
>
> Or I can combine as below (im gonna reword when I actually document
> this):
> 1. In general, we should add a JIRA ID as prefix of a test when a PR
> targets to fix a specific issue.
> In practice, it usually happens when a JIRA type is a bug or a PR
> adds a couple of tests.
> 2. Uses "SPARK-: test name" format
>
> If we have no objection with ^, let me go with this.
>
> 2019년 11월 13일 (수) 오전 8:14, Sean Owen 님이 작성:
>
>> Let's suggest "SPARK-12345:" but not go back and change a 

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-17 Thread Steve Loughran
Can I take this moment to remind everyone that the version of hive which
spark has historically bundled (the org.spark-project one) is an orphan
project put together to deal with Hive's shading issues and a source of
unhappiness in the Hive project. What ever get shipped should do its best
to avoid including that file.

Postponing a switch to hadoop 3.x after spark 3.0 is probably the safest
move from a risk minimisation perspective. If something has broken then it
is you can start with the assumption that it is in the o.a.s packages
without having to debug o.a.hadoop and o.a.hive first. There is a cost: if
there are problems with the hadoop / hive dependencies those teams will
inevitably ignore filed bug reports for the same reason spark team will
probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for the
Hadoop 2.x line include any compatibility issues with Java 9+. Do bear that
in mind. It's not been tested, it has dependencies on artifacts we know are
incompatible, and as far as the Hadoop project is concerned: people should
move to branch 3 if they want to run on a modern version of Java

It would be really really good if the published spark maven artefacts (a)
included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop 3.x.
That way people doing things with their own projects will get up-to-date
dependencies and don't get WONTFIX responses themselves.

-Steve

PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last ever"
branch-2 release and then declare its predecessors EOL; 2.10 will be the
transition release.

On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian  wrote:

> Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I
> thought the original proposal was to replace Hive 1.2 with Hive 2.3, which
> seemed risky, and therefore we only introduced Hive 2.3 under the
> hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong
> here...
>
> Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed that
> Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not
> about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11
> upgrade together looks too risky.
>
> On Sat, Nov 16, 2019 at 4:03 AM Sean Owen  wrote:
>
>> I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
>> than introduce yet another build combination. Does Hadoop 2 + Hive 2
>> work and is there demand for it?
>>
>> On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan  wrote:
>> >
>> > Do we have a limitation on the number of pre-built distributions? Seems
>> this time we need
>> > 1. hadoop 2.7 + hive 1.2
>> > 2. hadoop 2.7 + hive 2.3
>> > 3. hadoop 3 + hive 2.3
>> >
>> > AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so
>> don't need to add JDK version to the combination.
>> >
>> > On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun 
>> wrote:
>> >>
>> >> Thank you for suggestion.
>> >>
>> >> Having `hive-2.3` profile sounds good to me because it's orthogonal to
>> Hadoop 3.
>> >> IIRC, originally, it was proposed in that way, but we put it under
>> `hadoop-3.2` to avoid adding new profiles at that time.
>> >>
>> >> And, I'm wondering if you are considering additional pre-built
>> distribution and Jenkins jobs.
>> >>
>> >> Bests,
>> >> Dongjoon.
>> >>
>>
>