Re: Ask for ARM CI for spark

2019-07-30 Thread bo zhaobo
Hi, team.
I want to make the same test on ARM like existing CI does(x86). As building
and testing the whole spark projects will cost too long time, so I plan to
split them to multiple jobs to run for lower time cost. But I cannot see
what the existing CI[1] have done(so many private scripts called), so could
any CI maintainers help/tell us for how to split them and the details about
different CI jobs does? Such as PR title contains [SQL], [INFRA], [ML],
[DOC], [CORE], [PYTHON], [k8s], [DSTREAMS], [MLlib], [SCHEDULER],
[SS],[YARN], [BUIILD] and etc..I found each of them seems run the different
CI job.

@shane knapp,
Oh, sorry for disturb. I found your email looks like from 'berkeley.edu',
are you the good guy who we are looking for help about this? ;-)
If so, could you give some helps or advices? Thank you.

Thank you very much,

Best Regards,

ZhaoBo

[1] https://amplab.cs.berkeley.edu/jenkins




[image: Mailtrack]

Sender
notified by
Mailtrack

19/07/31
上午11:53:36

Tianhua huang  于2019年7月29日周一 上午9:38写道:

> @Sean Owen   Thank you very much. And I saw your reply
> comment in https://issues.apache.org/jira/browse/SPARK-28519, I will test
> with modification and to see whether there are other similar tests fail,
> and will address them together in one pull request.
>
> On Sat, Jul 27, 2019 at 9:04 PM Sean Owen  wrote:
>
>> Great thanks - we can take this to JIRAs now.
>> I think it's worth changing the implementation of atanh if the test value
>> just reflects what Spark does, and there's evidence is a little bit
>> inaccurate.
>> There's an equivalent formula which seems to have better accuracy.
>>
>> On Fri, Jul 26, 2019 at 10:02 PM Takeshi Yamamuro 
>> wrote:
>>
>>> Hi, all,
>>>
>>> FYI:
>>> >> @Yuming Wang the results in float8.sql are from PostgreSQL directly?
>>> >> Interesting if it also returns the same less accurate result, which
>>> >> might suggest it's more to do with underlying OS math libraries. You
>>> >> noted that these tests sometimes gave platform-dependent differences
>>> >> in the last digit, so wondering if the test value directly reflects
>>> >> PostgreSQL or just what we happen to return now.
>>>
>>> The results in float8.sql.out were recomputed in Spark/JVM.
>>> The expected output of the PostgreSQL test is here:
>>> https://github.com/postgres/postgres/blob/master/src/test/regress/expected/float8.out#L493
>>>
>>> As you can see in the file (float8.out), the results other than atanh
>>> also are different between Spark/JVM and PostgreSQL.
>>> For example, the answers of acosh are:
>>> -- PostgreSQL
>>>
>>> https://github.com/postgres/postgres/blob/master/src/test/regress/expected/float8.out#L487
>>> 1.31695789692482
>>>
>>> -- Spark/JVM
>>>
>>> https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/results/pgSQL/float8.sql.out#L523
>>> 1.3169578969248166
>>>
>>> btw, the PostgreSQL implementation for atanh just calls atanh in math.h:
>>>
>>> https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/float.c#L2606
>>>
>>> Bests,
>>> Takeshi
>>>
>>>


Re: [build system] upcoming jenkins downtime: august 3rd 2019

2019-07-30 Thread shane knapp
On Fri, Jun 14, 2019 at 9:13 AM shane knapp  wrote:

> the campus colo will be performing some electrical maintenance, which
> means that they'll be powering off the entire building.
>
> since the jenkins cluster is located in that colo, we are most definitely
> affected.  :)
>
> i'll be out of town that weekend, but will have one of my sysadmins bring
> everything back up on sunday, august 4th.  if they run in to issues, i will
> jump in first thing monday, august 5th.
>
> as the time approaches, i will send reminders and updates.
>
> hey everyone, just wanted to post a reminder about the upcoming jenkins
outage this weekend.

machines will be powered off friday night, and hopefully everything comes
back up on sunday.

if we have any problems, i will take care of things monday morning.



-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


unsubscribe

2019-07-30 Thread Charan Gupta
unsubscribe


Re: Apache Training contribution for Spark - Feedback welcome

2019-07-30 Thread Lars Francke
On Mon, Jul 29, 2019 at 2:46 PM Sean Owen  wrote:

> TL;DR is: take the below as feedback to consider, and proceed as you
> see fit. Nobody's suggesting you can't do this.
>
> On Mon, Jul 29, 2019 at 2:58 AM Lars Francke 
> wrote:
> > The way I read your point is that anyone can publish material (which
> includes source code) under the ALv2 outside of the ASF so why should they
> donate anything to the ASF?
> > If that's what you meant why have Apache Spark or any other Apache
> project for that matter.
> >> I think your premise is that people will _collaborate_ on training
> >> materials if there's an ASF project around it. Maybe so but see below.
> > That's our hope, yes. Should we not do this because it _could_ fail?
>
> Yep this is the answer to your question. The ASF exists to facilitate
> collaboration, not just host. I think the dynamics around
> collaboration on open standard software vs training materials are
> materially different.
>

I don't see a big difference between the two things.
Content is already being collaborated on today (see documentation, websites
and the few instances of training that exist or Wikipedia for that matter).
I'm afraid we'll need to agree to disagree on this one.


> > We - as a company - have created material and sold it for years but
> every time I give a training I see something that I should have updated and
> it's become impossible to keep up. I see the same outdated material from
> other organizations, we've talked to half a dozen or so training companies
> and they all have the same problem. To create quality training material you
> really need someone with deep insider knowledge, and those people are hard
> to come by.
> > So we're trying to shift and collaborate on the material and then
> differentiate ourselves by the trainer itself.
>
> I think this hand-waves past a lot of the concern raised here, but OK
> it's an experiment.
> I don't think it's 'wrong' to try to get people to collaborate on
> slides, sure. It may work well. If it doesn't for reasons raised here,
> well, worse things have happened.
> Consider how you might mitigate possible problems:
> a) what happens when another company wants to donate its Spark content?
>

This has been decided at the ASF level already (allow competing projects,
e.g. Flink & Spark). At the Apache Training level we briefly talked about
that as well. I don't want to go into details of the process but the short
version is: We'd accept anything and would then try to incorporate it into
existing stuff.

b) can you enshrine some best practices like making sure the content
> disclaims official association with the ASF? e.g. a trainer delivering
> it has to note the source but make clear it's not Apache training,
>

Yes.


> etc.
>


Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-30 Thread Hyukjin Kwon
>From my look, +1 on the proposal, considering ASCI and other DBMSes in
general.

2019년 7월 30일 (화) 오후 3:21, Wenchen Fan 님이 작성:

> We can add a config for a certain behavior if it makes sense, but the most
> important thing we want to reach an agreement here is: what should be the
> default behavior?
>
> Let's explore the solution space of table insertion behavior first:
> At compile time,
> 1. always add cast
> 2. add cast following the ASNI SQL store assignment rule (e.g. string to
> int is forbidden but long to int is allowed)
> 3. only add cast if it's 100% safe
> At runtime,
> 1. return null for invalid operations
> 2. throw exceptions at runtime for invalid operations
>
> The standards to evaluate a solution:
> 1. How robust the query execution is. For example, users usually don't
> want to see the query fails midway.
> 2. how tolerant to user queries. For example, a user would like to write
> long values to an int column as he knows all the long values won't exceed
> int range.
> 3. How clean the result is. For example, users usually don't want to see
> silently corrupted data (null values).
>
> The current Spark behavior for Data Source V1 tables: always add cast and
> return null for invalid operations. This maximizes standard 1 and 2, but
> the result is least clean and users are very likely to see silently
> corrupted data (null values).
>
> The current Spark behavior for Data Source V2 tables (new in Spark 3.0):
> only add cast if it's 100% safe. This maximizes standard 1 and 3, but many
> queries may fail to compile, even if these queries can run on other SQL
> systems. Note that, people can still see silently corrupted data because
> cast is not the only one that can return corrupted data. Simple operations
> like ADD can also return corrected data if overflow happens. e.g. INSERT
> INTO t1 (intCol) SELECT anotherIntCol + 100 FROM t2
>
> The proposal here: add cast following ANSI SQL store assignment rule, and
> return null for invalid operations. This maximizes standard 1, and also
> fits standard 2 well: if a query can't compile in Spark, it usually can't
> compile in other mainstream databases as well. I think that's tolerant
> enough. For standard 3, this proposal doesn't maximize it but can avoid
> many invalid operations already.
>
> Technically we can't make the result 100% clean at compile-time, we have
> to handle things like overflow at runtime. I think the new proposal makes
> more sense as the default behavior.
>
>
> On Mon, Jul 29, 2019 at 8:31 PM Russell Spitzer 
> wrote:
>
>> I understand spark is making the decisions, i'm say the actual final
>> effect of the null decision would be different depending on the insertion
>> target if the target has different behaviors for null.
>>
>> On Mon, Jul 29, 2019 at 5:26 AM Wenchen Fan  wrote:
>>
>>> > I'm a big -1 on null values for invalid casts.
>>>
>>> This is why we want to introduce the ANSI mode, so that invalid cast
>>> fails at runtime. But we have to keep the null behavior for a while, to
>>> keep backward compatibility. Spark returns null for invalid cast since the
>>> first day of Spark SQL, we can't just change it without a way to restore to
>>> the old behavior.
>>>
>>> I'm OK with adding a strict mode for the upcast behavior in table
>>> insertion, but I don't agree with making it the default. The default
>>> behavior should be either the ANSI SQL behavior or the legacy Spark
>>> behavior.
>>>
>>> > other modes should be allowed only with strict warning the behavior
>>> will be determined by the underlying sink.
>>>
>>> Seems there is some misunderstanding. The table insertion behavior is
>>> fully controlled by Spark. Spark decides when to add cast and Spark decided
>>> whether invalid cast should return null or fail. The sink is only
>>> responsible for writing data, not the type coercion/cast stuff.
>>>
>>> On Sun, Jul 28, 2019 at 12:24 AM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
 I'm a big -1 on null values for invalid casts. This can lead to a lot
 of even more unexpected errors and runtime behavior since null is

 1. Not allowed in all schemas (Leading to a runtime error anyway)
 2. Is the same as delete in some systems (leading to data loss)

 And this would be dependent on the sink being used. Spark won't just be
 interacting with ANSI compliant sinks so I think it makes much more sense
 to be strict. I think Upcast mode is a sensible default and other modes
 should be allowed only with strict warning the behavior will be determined
 by the underlying sink.

 On Sat, Jul 27, 2019 at 8:05 AM Takeshi Yamamuro 
 wrote:

> Hi, all
>
> +1 for implementing this new store cast mode.
> From a viewpoint of DBMS users, this cast is pretty common for INSERTs
> and I think this functionality could
> promote migrations from existing DBMSs to Spark.
>
> The most important thing for DBMS users is that they could 

Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-30 Thread Wenchen Fan
We can add a config for a certain behavior if it makes sense, but the most
important thing we want to reach an agreement here is: what should be the
default behavior?

Let's explore the solution space of table insertion behavior first:
At compile time,
1. always add cast
2. add cast following the ASNI SQL store assignment rule (e.g. string to
int is forbidden but long to int is allowed)
3. only add cast if it's 100% safe
At runtime,
1. return null for invalid operations
2. throw exceptions at runtime for invalid operations

The standards to evaluate a solution:
1. How robust the query execution is. For example, users usually don't want
to see the query fails midway.
2. how tolerant to user queries. For example, a user would like to write
long values to an int column as he knows all the long values won't exceed
int range.
3. How clean the result is. For example, users usually don't want to see
silently corrupted data (null values).

The current Spark behavior for Data Source V1 tables: always add cast and
return null for invalid operations. This maximizes standard 1 and 2, but
the result is least clean and users are very likely to see silently
corrupted data (null values).

The current Spark behavior for Data Source V2 tables (new in Spark 3.0):
only add cast if it's 100% safe. This maximizes standard 1 and 3, but many
queries may fail to compile, even if these queries can run on other SQL
systems. Note that, people can still see silently corrupted data because
cast is not the only one that can return corrupted data. Simple operations
like ADD can also return corrected data if overflow happens. e.g. INSERT
INTO t1 (intCol) SELECT anotherIntCol + 100 FROM t2

The proposal here: add cast following ANSI SQL store assignment rule, and
return null for invalid operations. This maximizes standard 1, and also
fits standard 2 well: if a query can't compile in Spark, it usually can't
compile in other mainstream databases as well. I think that's tolerant
enough. For standard 3, this proposal doesn't maximize it but can avoid
many invalid operations already.

Technically we can't make the result 100% clean at compile-time, we have to
handle things like overflow at runtime. I think the new proposal makes more
sense as the default behavior.


On Mon, Jul 29, 2019 at 8:31 PM Russell Spitzer 
wrote:

> I understand spark is making the decisions, i'm say the actual final
> effect of the null decision would be different depending on the insertion
> target if the target has different behaviors for null.
>
> On Mon, Jul 29, 2019 at 5:26 AM Wenchen Fan  wrote:
>
>> > I'm a big -1 on null values for invalid casts.
>>
>> This is why we want to introduce the ANSI mode, so that invalid cast
>> fails at runtime. But we have to keep the null behavior for a while, to
>> keep backward compatibility. Spark returns null for invalid cast since the
>> first day of Spark SQL, we can't just change it without a way to restore to
>> the old behavior.
>>
>> I'm OK with adding a strict mode for the upcast behavior in table
>> insertion, but I don't agree with making it the default. The default
>> behavior should be either the ANSI SQL behavior or the legacy Spark
>> behavior.
>>
>> > other modes should be allowed only with strict warning the behavior
>> will be determined by the underlying sink.
>>
>> Seems there is some misunderstanding. The table insertion behavior is
>> fully controlled by Spark. Spark decides when to add cast and Spark decided
>> whether invalid cast should return null or fail. The sink is only
>> responsible for writing data, not the type coercion/cast stuff.
>>
>> On Sun, Jul 28, 2019 at 12:24 AM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> I'm a big -1 on null values for invalid casts. This can lead to a lot of
>>> even more unexpected errors and runtime behavior since null is
>>>
>>> 1. Not allowed in all schemas (Leading to a runtime error anyway)
>>> 2. Is the same as delete in some systems (leading to data loss)
>>>
>>> And this would be dependent on the sink being used. Spark won't just be
>>> interacting with ANSI compliant sinks so I think it makes much more sense
>>> to be strict. I think Upcast mode is a sensible default and other modes
>>> should be allowed only with strict warning the behavior will be determined
>>> by the underlying sink.
>>>
>>> On Sat, Jul 27, 2019 at 8:05 AM Takeshi Yamamuro 
>>> wrote:
>>>
 Hi, all

 +1 for implementing this new store cast mode.
 From a viewpoint of DBMS users, this cast is pretty common for INSERTs
 and I think this functionality could
 promote migrations from existing DBMSs to Spark.

 The most important thing for DBMS users is that they could optionally
 choose this mode when inserting data.
 Therefore, I think it might be okay that the two modes (the current
 upcast mode and the proposed store cast mode)
 co-exist for INSERTs. (There is a room to discuss which mode  is
 enabled by default though...)