Re: Ask for ARM CI for spark

2019-07-26 Thread Takeshi Yamamuro
Hi, all,

FYI:
>> @Yuming Wang the results in float8.sql are from PostgreSQL directly?
>> Interesting if it also returns the same less accurate result, which
>> might suggest it's more to do with underlying OS math libraries. You
>> noted that these tests sometimes gave platform-dependent differences
>> in the last digit, so wondering if the test value directly reflects
>> PostgreSQL or just what we happen to return now.

The results in float8.sql.out were recomputed in Spark/JVM.
The expected output of the PostgreSQL test is here:
https://github.com/postgres/postgres/blob/master/src/test/regress/expected/float8.out#L493

As you can see in the file (float8.out), the results other than atanh also
are different between Spark/JVM and PostgreSQL.
For example, the answers of acosh are:
-- PostgreSQL
https://github.com/postgres/postgres/blob/master/src/test/regress/expected/float8.out#L487
1.31695789692482

-- Spark/JVM
https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/results/pgSQL/float8.sql.out#L523
1.3169578969248166

btw, the PostgreSQL implementation for atanh just calls atanh in math.h:
https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/float.c#L2606

Bests,
Takeshi

On Sat, Jul 27, 2019 at 10:35 AM bo zhaobo 
wrote:

> Hi all,
>
> Thanks for your concern. Yeah, that's worth to also test in backend
> database. But need to note here, this issue is hit in Spark SQL, as we only
> test it with spark itself, not integrate other databases.
>
> Best Regards,
>
> ZhaoBo
>
>
>
> [image: Mailtrack]
> 
>  Sender
> notified by
> Mailtrack
> 
>  19/07/27
> 上午09:30:56
>
> Sean Owen  于2019年7月26日周五 下午5:46写道:
>
>> Interesting. I don't think log(3) is special, it's just that some
>> differences in how it's implemented and floating-point values on
>> aarch64 vs x86, or in the JVM, manifest at some values like this. It's
>> still a little surprising! BTW Wolfram Alpha suggests that the correct
>> value is more like ...810969..., right between the two. java.lang.Math
>> doesn't guarantee strict IEEE floating-point behavior, but
>> java.lang.StrictMath is supposed to, at the potential cost of speed,
>> and it gives ...81096, in agreement with aarch64.
>>
>> @Yuming Wang the results in float8.sql are from PostgreSQL directly?
>> Interesting if it also returns the same less accurate result, which
>> might suggest it's more to do with underlying OS math libraries. You
>> noted that these tests sometimes gave platform-dependent differences
>> in the last digit, so wondering if the test value directly reflects
>> PostgreSQL or just what we happen to return now.
>>
>> One option is to use StrictMath in special cases like computing atanh.
>> That gives a value that agrees with aarch64.
>> I also note that 0.5 * (math.log(1 + x) - math.log(1 - x) gives the
>> more accurate answer too, and makes the result agree with, say,
>> Wolfram Alpha for atanh(0.5).
>> (Actually if we do that, better still is 0.5 * (math.log1p(x) -
>> math.log1p(-x)) for best accuracy near 0)
>> Commons Math also has implementations of sinh, cosh, atanh that we
>> could call. It claims it's possibly more accurate and faster. I
>> haven't tested its result here.
>>
>> FWIW the "log1p" version appears, from some informal testing, to be
>> most accurate (in agreement with Wolfram) and using StrictMath doesn't
>> matter. If we change something, I'd use that version above.
>> The only issue is if this causes the result to disagree with
>> PostgreSQL, but then again it's more correct and maybe the DB is
>> wrong.
>>
>>
>> The rest may be a test vs PostgreSQL issue; see
>> https://issues.apache.org/jira/browse/SPARK-28316
>>
>>
>> On Fri, Jul 26, 2019 at 2:32 AM Tianhua huang 
>> wrote:
>> >
>> > Hi, all
>> >
>> >
>> > Sorry to disturb again, there are several sql tests failed on arm64
>> instance:
>> >
>> > pgSQL/float8.sql *** FAILED ***
>> > Expected "0.549306144334054[9]", but got "0.549306144334054[8]" Result
>> did not match for query #56
>> > SELECT atanh(double('0.5')) (SQLQueryTestSuite.scala:362)
>> > pgSQL/numeric.sql *** FAILED ***
>> > Expected "2 2247902679199174[72 224790267919917955.1326161858
>> > 4 7405685069595001 7405685069594999.0773399947
>> > 5 5068226527.321263 5068226527.3212726541
>> > 6 281839893606.99365 281839893606.9937234336
>> > 7 1716699575118595840 1716699575118597095.4233081991
>> > 8 167361463828.0749 167361463828.0749132007
>> > 9 107511333880051856] 107511333880052007", but got "2
>> 2247902679199174[40224790267919917955.1326161858
>> > 4 7405685069595001 7405685069594999.0773399947
>> > 5 5068226527.321263 5068226527.3212726541
>> > 6 281839893606.99365 281839893606.9937234336
>> > 7 1716699575118595580 1716699575118597095.4233081991
>> > 8 167361463828.0749 167361463828.0749132007
>> > 9 107511333880051872] 

Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-26 Thread Wenchen Fan
I don't agree with handling literal values specially. Although Postgres
does it, I can't find anything about it in the SQL standard. And it
introduces inconsistent behaviors which may be strange to users:
* What about something like "INSERT INTO t SELECT float_col + 1.1"?
* The same insert with a decimal column as input will fail even when a
decimal literal would succeed
* Similar insert queries with "literal" inputs can be constructed through
layers of indirection via views, inline views, CTEs, unions, etc. Would
those decimals be treated as columns and fail or would we attempt to make
them succeed as well? Would users find this behavior surprising?

Silently corrupt data is bad, but this is the decision we made at the
beginning when design Spark behaviors. Whenever an error occurs, Spark
attempts to return null instead of runtime exception. Recently we provide
configs to make Spark fail at runtime for overflow, but that's another
story. Silently corrupt data is bad, runtime exception is bad, and
forbidding all the table insertions that may fail(even with very little
possibility) is also bad. We have to make trade-offs. The trade-offs we
made in this proposal are:
* forbid table insertions that are very like to fail, at compile time.
(things like writing string values to int column)
* allow table insertions that are not that likely to fail. If the data is
wrong, don't fail, insert null.
* provide a config to fail the insertion at runtime if the data is wrong.

>  But the new behavior is only applied in DataSourceV2, so it won’t affect
existing jobs until sources move to v2 and break other behavior anyway.
When users write SQL queries, they don't care if a table is backed by Data
Source V1 or V2. We should make sure the table insertion behavior is
consistent and reasonable. Furthermore, users may even not care if the SQL
queries are run in Spark or other RDBMS, it's better to follow SQL standard
instead of introducing a Spark-specific behavior.

We are not talking about a small use case like allowing writing decimal
literal to float column, we are talking about a big goal to make Spark
compliant to SQL standard, w.r.t.
https://issues.apache.org/jira/browse/SPARK-26217 . This proposal is a
sub-task of it, to make the table insertion behavior follow SQL standard.

On Sat, Jul 27, 2019 at 1:35 AM Ryan Blue  wrote:

> I don’t think this is a good idea. Following the ANSI standard is usually
> fine, but here it would *silently corrupt data*.
>
> From your proposal doc, ANSI allows implicitly casting from long to int
> (any numeric type to any other numeric type) and inserts NULL when a value
> overflows. That would drop data values and is not safe.
>
> Fixing the silent corruption by adding a runtime exception is not a good
> option, either. That puts off the problem until much of the job has
> completed, instead of catching the error at analysis time. It is better to
> catch this earlier during analysis than to run most of a job and then fail.
>
> In addition, part of the justification for using the ANSI standard is to
> avoid breaking existing jobs. But the new behavior is only applied in
> DataSourceV2, so it won’t affect existing jobs until sources move to v2 and
> break other behavior anyway.
>
> I think that the correct solution is to go with the existing validation
> rules that require explicit casts to truncate values.
>
> That still leaves the use case that motivated this proposal, which is that
> floating point literals are parsed as decimals and fail simple insert
> statements. We already came up with two alternatives to fix that problem in
> the DSv2 sync and I think it is a better idea to go with one of those
> instead of “fixing” Spark in a way that will corrupt data or cause runtime
> failures.
>
> On Thu, Jul 25, 2019 at 9:11 AM Wenchen Fan  wrote:
>
>> I have heard about many complaints about the old table insertion
>> behavior. Blindly casting everything will leak the user mistake to a late
>> stage of the data pipeline, and make it very hard to debug. When a user
>> writes string values to an int column, it's probably a mistake and the
>> columns are misordered in the INSERT statement. We should fail the query
>> earlier and ask users to fix the mistake.
>>
>> In the meanwhile, I agree that the new table insertion behavior we
>> introduced for Data Source V2 is too strict. It may fail valid queries
>> unexpectedly.
>>
>> In general, I support the direction of following the ANSI SQL standard.
>> But I'd like to do it with 2 steps:
>> 1. only add cast when the assignment rule is satisfied. This should be
>> the default behavior and we should provide a legacy config to restore to
>> the old behavior.
>> 2. fail the cast operation at runtime if overflow happens. AFAIK Marco
>> Gaido is working on it already. This will have a config as well and by
>> default we still return null.
>>
>> After doing this, the default behavior will be slightly different from
>> the SQL standard (cast can return 

Re: Ask for ARM CI for spark

2019-07-26 Thread bo zhaobo
Hi all,

Thanks for your concern. Yeah, that's worth to also test in backend
database. But need to note here, this issue is hit in Spark SQL, as we only
test it with spark itself, not integrate other databases.

Best Regards,

ZhaoBo



[image: Mailtrack]

Sender
notified by
Mailtrack

19/07/27
上午09:30:56

Sean Owen  于2019年7月26日周五 下午5:46写道:

> Interesting. I don't think log(3) is special, it's just that some
> differences in how it's implemented and floating-point values on
> aarch64 vs x86, or in the JVM, manifest at some values like this. It's
> still a little surprising! BTW Wolfram Alpha suggests that the correct
> value is more like ...810969..., right between the two. java.lang.Math
> doesn't guarantee strict IEEE floating-point behavior, but
> java.lang.StrictMath is supposed to, at the potential cost of speed,
> and it gives ...81096, in agreement with aarch64.
>
> @Yuming Wang the results in float8.sql are from PostgreSQL directly?
> Interesting if it also returns the same less accurate result, which
> might suggest it's more to do with underlying OS math libraries. You
> noted that these tests sometimes gave platform-dependent differences
> in the last digit, so wondering if the test value directly reflects
> PostgreSQL or just what we happen to return now.
>
> One option is to use StrictMath in special cases like computing atanh.
> That gives a value that agrees with aarch64.
> I also note that 0.5 * (math.log(1 + x) - math.log(1 - x) gives the
> more accurate answer too, and makes the result agree with, say,
> Wolfram Alpha for atanh(0.5).
> (Actually if we do that, better still is 0.5 * (math.log1p(x) -
> math.log1p(-x)) for best accuracy near 0)
> Commons Math also has implementations of sinh, cosh, atanh that we
> could call. It claims it's possibly more accurate and faster. I
> haven't tested its result here.
>
> FWIW the "log1p" version appears, from some informal testing, to be
> most accurate (in agreement with Wolfram) and using StrictMath doesn't
> matter. If we change something, I'd use that version above.
> The only issue is if this causes the result to disagree with
> PostgreSQL, but then again it's more correct and maybe the DB is
> wrong.
>
>
> The rest may be a test vs PostgreSQL issue; see
> https://issues.apache.org/jira/browse/SPARK-28316
>
>
> On Fri, Jul 26, 2019 at 2:32 AM Tianhua huang 
> wrote:
> >
> > Hi, all
> >
> >
> > Sorry to disturb again, there are several sql tests failed on arm64
> instance:
> >
> > pgSQL/float8.sql *** FAILED ***
> > Expected "0.549306144334054[9]", but got "0.549306144334054[8]" Result
> did not match for query #56
> > SELECT atanh(double('0.5')) (SQLQueryTestSuite.scala:362)
> > pgSQL/numeric.sql *** FAILED ***
> > Expected "2 2247902679199174[72 224790267919917955.1326161858
> > 4 7405685069595001 7405685069594999.0773399947
> > 5 5068226527.321263 5068226527.3212726541
> > 6 281839893606.99365 281839893606.9937234336
> > 7 1716699575118595840 1716699575118597095.4233081991
> > 8 167361463828.0749 167361463828.0749132007
> > 9 107511333880051856] 107511333880052007", but got "2
> 2247902679199174[40224790267919917955.1326161858
> > 4 7405685069595001 7405685069594999.0773399947
> > 5 5068226527.321263 5068226527.3212726541
> > 6 281839893606.99365 281839893606.9937234336
> > 7 1716699575118595580 1716699575118597095.4233081991
> > 8 167361463828.0749 167361463828.0749132007
> > 9 107511333880051872] 107511333880052007" Result did not match for
> query #496
> > SELECT t1.id1, t1.result, t2.expected
> > FROM num_result t1, num_exp_power_10_ln t2
> > WHERE t1.id1 = t2.id
> > AND t1.result != t2.expected (SQLQueryTestSuite.scala:362)
> >
> > The first test failed, because the value of math.log(3.0) is different
> on aarch64:
> >
> > # on x86_64:
> >
> > scala> val a = 0.5
> > a: Double = 0.5
> >
> > scala> a * math.log((1.0 + a) / (1.0 - a))
> > res1: Double = 0.5493061443340549
> >
> > scala> math.log((1.0 + a) / (1.0 - a))
> > res2: Double = 1.0986122886681098
> >
> > # on aarch64:
> >
> > scala> val a = 0.5
> >
> > a: Double = 0.5
> >
> > scala> a * math.log((1.0 + a) / (1.0 - a))
> >
> > res20: Double = 0.5493061443340548
> >
> > scala> math.log((1.0 + a) / (1.0 - a))
> >
> > res21: Double = 1.0986122886681096
> >
> > And I tried other several numbers like math.log(4.0) and math.log(5.0)
> and they are same, I don't know why math.log(3.0) is so special? But the
> result is different indeed on aarch64. If you are interesting, please try
> it.
> >
> > The second test failed, because some values of pow(10, x) is different
> on aarch64, according to sql tests of spark, I took similar tests on
> aarch64 and x86_64, take '-83028485' as example:
> >
> > # on x86_64:
> > scala> import java.lang.Math._
> > import java.lang.Math._
> > scala> var a = -83028485
> > a: Int = 

Re: Apache Training contribution for Spark - Feedback welcome

2019-07-26 Thread Sean Owen
On Fri, Jul 26, 2019 at 4:01 PM Lars Francke  wrote:
> I understand why it might be seen that way and we need to make sure to point 
> out that we have no intention of becoming "The official Apache Spark 
> training" because that's not our intention at all.

Of course that's the intention; the problem is perception, and I think
that's a real problem no matter the intention.


> In this case, however, a company decided to donate their internal material - 
> they didn't create this from scratch for the Apache Training project.
> We want to encourage contributions and just because someone else has already 
> created material shouldn't stop us from accepting this.

This much doesn't seem like a compelling motive. Anyone can already
donate their materials to the public domain or publish under the ALv2.
The existence of an Apache project around it doesn't do anything...
except your point below maybe:


> Every company creates its own material as an asset to sell. There's very 
> little quality open-source material out there.

(Except the example I already gave, among many others! There's a lot
of free content)


> We did some research around training and especially open-source training 
> before we started the initiative and there are some projects out there that 
> do this but all we found were silos with a relatively narrow focus and no 
> greater community.

I think your premise is that people will _collaborate_ on training
materials if there's an ASF project around it. Maybe so but see below.


> Regarding your "outlines" comment: No, this is the "final" material (pending 
> review of course). With "Training" we mean training in the sense that 
> Cloudera, Databricks et. al. sell as well where an instructor-led course is 
> being given using slides. These slides can, but don't have to speak for 
> themselves. We're fine with the requirement that an experienced instructor 
> needs to give this training. But this is just this content. We're also happy 
> to accept other forms of content that are meant for a different way of 
> consumption (self-serve). We don't intend to write exhaustive or 
> authoritative documentation for projects.

Are we talking about the content attached at TRAINING-17? It doesn't
look nearly complete or comprehensive enough to endorse as Spark
training material, IMHO. Again compare to even Jacek's site and
content for an example of what I think that would look like. It's
orders of magnitude more complete. I speak for myself, but I would not
want to endorse that as Spark training with my Apache hat.

I know the premise is, I think, these are _slides_ that trainers can
deliver, but by themselves there is not enough content for trainers to
know what to train.

What is the need the solves -- is there really demand for 'open
source' training materials? my experience is that training is by
definition professional services, and has to be delivered by people as
a for-pay business, and they need to differentiate on the quality they
provide. It's just materially different from having open standard
software.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Apache Training contribution for Spark - Feedback welcome

2019-07-26 Thread Lars Francke
Sean,

thanks for taking the time to comment.

We've discussed those issues during the proposal stage for the Incubator as
others brought them up as well. I can't remember all the details but let me
go through your points inline.

My reservation here is that as an Apache project, it might appear to
> 'bless' one set of materials as authoritative over all the others out
> there.


I understand why it might be seen that way and we need to make sure to
point out that we have no intention of becoming "The official Apache Spark
training" because that's not our intention at all.


> And there are already lots of good ones. For example, Jacek has
> long maintained a very comprehensive set of free Spark training
> materials at https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
> In comparison the slides I see proposed so far only seem like
> outlines?
>

Jacek is indeed doing a fantastic job (and I'm sure others as well).

In this case, however, a company decided to donate their internal material
- they didn't create this from scratch for the Apache Training project.
We want to encourage contributions and just because someone else has
already created material shouldn't stop us from accepting this.

The opposite in fact: There's very little collaboration - in general -
around training material.
Every company creates its own material as an asset to sell. There's very
little quality open-source material out there.
I'm not sure how many companies have created Spark training courses. I
wouldn't be surprised if it goes into the hundreds. And everyone draws the
same or very similar slides (what's an RDD, what's a DataFrame etc.)
We hope to change that and this contribution can be a first start.

We did some research around training and especially open-source training
before we started the initiative and there are some projects out there that
do this but all we found were silos with a relatively narrow focus and no
greater community.

Regarding your "outlines" comment: No, this is the "final" material
(pending review of course). With "Training" we mean training in the sense
that Cloudera, Databricks et. al. sell as well where an instructor-led
course is being given using slides. These slides can, but don't have to
speak for themselves. We're fine with the requirement that an experienced
instructor needs to give this training. But this is just this content.
We're also happy to accept other forms of content that are meant for a
different way of consumption (self-serve). We don't intend to write
exhaustive or authoritative documentation for projects.

It just frees people from having to do the tedious work of creating (and
updating) hundreds of slides.

It's also a separate project from Spark. We might have trouble
> ensuring the info is maintained and up to date, and sometimes outdated
> or incorrect info is worse than none - especially if it appears quasi
> official. The Spark project already maintains and updates its docs
> (which can always be better), so already has its hands full there.
>

Definitely. Outdated information is always a danger and I have no guarantee
that this isn't going to happen here.
The fact that this is hosted and governed by the ASF makes it less likely
to be completely abandoned though as there are clear processes in place for
collaboration that don't depend on a single person (which might be the case
with some of the other things that already exist).
We also hope that communities - like Spark - are also interested in
collaborating and while patches are always welcome so is creating a Jira to
point out outdated information.


> Personally, no strong objection here, but, what's the upside to
> running this as an ASF project vs just letting people continue to
> publish quality tutorials online?
>

Some points come to mind, this list is neither exhaustive nor do all points
apply equally to all the material that others have published:

- Clear and easy guidelines for collaboration
- Not a "bus factor" of one
- Everything is open-source with a friendly license and customizable
- We're still just getting started but because we already have four or five
different contributions we can share one technology stack between all of
them making it easier to collaborate ("everything looks familiar") and
every piece of content benefits from improvements in the technical stack
- We hope to have non-tool focused sessions later as well (e.g. Ingesting
data from Kafka into Elasticsearch using Spark [okay, this would maybe be a
bit too specific for now but something along the lines of a "Data
Ingestion" training]) where we can mix and match from the content we have

I'd have to dig into the original discuss threads in the incubator to find
more but I hope this helps a bit?

Cheers,
Lars


>
>
> On Fri, Jul 26, 2019 at 9:00 AM Lars Francke 
> wrote:
> >
> > Hi Spark community,
> >
> > you may or may not have heard of a new-ish (February 2019) project at
> Apache: Apache Training (incubating). We aim to develop 

Re: [DISCUSS] New sections in Github Pull Request description template

2019-07-26 Thread Bryan Cutler
The k8s template is pretty good. Under the behavior change section, it
would be good to add instructions to also describe previous and new
behavior as Hyukjin proposed.

On Tue, Jul 23, 2019 at 10:07 PM Reynold Xin  wrote:

> I like the spirit, but not sure about the exact proposal. Take a look at
> k8s':
> https://raw.githubusercontent.com/kubernetes/kubernetes/master/.github/PULL_REQUEST_TEMPLATE.md
>
>
>
> On Tue, Jul 23, 2019 at 8:27 PM, Hyukjin Kwon  wrote:
>
>> (Plus, it helps to track history too. Spark's commit logs are growing and
>> now it's pretty difficult to track the history and see what change
>> introduced a specific behaviour)
>>
>> 2019년 7월 24일 (수) 오후 12:20, Hyukjin Kwon 님이 작성:
>>
>> Hi all,
>>
>> I would like to discuss about some new sections under "## What changes
>> were proposed in this pull request?":
>>
>> ### Do the changes affect _any_ user/dev-facing input or output?
>>
>> (Please answer yes or no. If yes, answer the questions below)
>>
>> ### What was the previous behavior?
>>
>> (Please provide the console output, description and/or reproducer about the 
>> previous behavior)
>>
>> ### What is the behavior the changes propose?
>>
>> (Please provide the console output, description and/or reproducer about the 
>> behavior the changes propose)
>>
>> See
>> https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE
>>  .
>>
>> From my experience so far in Spark community, and assuming from the
>> interaction with other
>> committers and contributors, It is pretty critical to know before/after
>> behaviour changes even if it
>> was a bug. In addition, I think this is requested by reviewers often.
>>
>> The new sections will make review process much easier, and we're able to
>> quickly judge how serious the changes are.
>> Given that Spark community still suffer from open PRs just queueing up
>> without review, I think this can help
>> both reviewers and PR authors.
>>
>> I do describe them often when I think it's useful and possible.
>> For instance see https://github.com/apache/spark/pull/24927 - I am sure
>> you guys have clear idea what the
>> PR fixes.
>>
>> I cc'ed some guys I can currently think of for now FYI. Please let me
>> know if you guys have any thought on this!
>>
>>
>


Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-26 Thread Ryan Blue
I don’t think this is a good idea. Following the ANSI standard is usually
fine, but here it would *silently corrupt data*.

>From your proposal doc, ANSI allows implicitly casting from long to int
(any numeric type to any other numeric type) and inserts NULL when a value
overflows. That would drop data values and is not safe.

Fixing the silent corruption by adding a runtime exception is not a good
option, either. That puts off the problem until much of the job has
completed, instead of catching the error at analysis time. It is better to
catch this earlier during analysis than to run most of a job and then fail.

In addition, part of the justification for using the ANSI standard is to
avoid breaking existing jobs. But the new behavior is only applied in
DataSourceV2, so it won’t affect existing jobs until sources move to v2 and
break other behavior anyway.

I think that the correct solution is to go with the existing validation
rules that require explicit casts to truncate values.

That still leaves the use case that motivated this proposal, which is that
floating point literals are parsed as decimals and fail simple insert
statements. We already came up with two alternatives to fix that problem in
the DSv2 sync and I think it is a better idea to go with one of those
instead of “fixing” Spark in a way that will corrupt data or cause runtime
failures.

On Thu, Jul 25, 2019 at 9:11 AM Wenchen Fan  wrote:

> I have heard about many complaints about the old table insertion behavior.
> Blindly casting everything will leak the user mistake to a late stage of
> the data pipeline, and make it very hard to debug. When a user writes
> string values to an int column, it's probably a mistake and the columns are
> misordered in the INSERT statement. We should fail the query earlier and
> ask users to fix the mistake.
>
> In the meanwhile, I agree that the new table insertion behavior we
> introduced for Data Source V2 is too strict. It may fail valid queries
> unexpectedly.
>
> In general, I support the direction of following the ANSI SQL standard.
> But I'd like to do it with 2 steps:
> 1. only add cast when the assignment rule is satisfied. This should be the
> default behavior and we should provide a legacy config to restore to the
> old behavior.
> 2. fail the cast operation at runtime if overflow happens. AFAIK Marco
> Gaido is working on it already. This will have a config as well and by
> default we still return null.
>
> After doing this, the default behavior will be slightly different from the
> SQL standard (cast can return null), and users can turn on the ANSI mode to
> fully follow the SQL standard. This is much better than before and should
> prevent a lot of user mistakes. It's also a reasonable choice to me to not
> throw exceptions at runtime by default, as it's usually bad for
> long-running jobs.
>
> Thanks,
> Wenchen
>
> On Thu, Jul 25, 2019 at 11:37 PM Gengliang Wang <
> gengliang.w...@databricks.com> wrote:
>
>> Hi everyone,
>>
>> I would like to discuss the table insertion behavior of Spark. In the
>> current data source V2, only UpCast is allowed for table insertion. I think
>> following ANSI SQL is a better idea.
>> For more information, please read the Discuss: Follow ANSI SQL on table
>> insertion
>> 
>> Please let me know if you have any thoughts on this.
>>
>> Regards,
>> Gengliang
>>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: Apache Training contribution for Spark - Feedback welcome

2019-07-26 Thread Sean Owen
Generally speaking, I think we want to encourage more training and
tutorial content out there, for sure, so, the more the merrier.

My reservation here is that as an Apache project, it might appear to
'bless' one set of materials as authoritative over all the others out
there. And there are already lots of good ones. For example, Jacek has
long maintained a very comprehensive set of free Spark training
materials at https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
In comparison the slides I see proposed so far only seem like
outlines?

It's also a separate project from Spark. We might have trouble
ensuring the info is maintained and up to date, and sometimes outdated
or incorrect info is worse than none - especially if it appears quasi
official. The Spark project already maintains and updates its docs
(which can always be better), so already has its hands full there.

Personally, no strong objection here, but, what's the upside to
running this as an ASF project vs just letting people continue to
publish quality tutorials online?



On Fri, Jul 26, 2019 at 9:00 AM Lars Francke  wrote:
>
> Hi Spark community,
>
> you may or may not have heard of a new-ish (February 2019) project at Apache: 
> Apache Training (incubating). We aim to develop training material about 
> various projects inside and outside the ASF: 
>
> One of our users wants to contribute material on Spark[1]
>
> We've done something similar for ZooKeeper[1] in the past and the ZooKeeper 
> community provided excellent feedback which helped make the product much 
> better[3].
>
> That's why I'd like to invite everyone here to provide any kind of feedback 
> on the content donation. It is currently in PowerPoint format which makes it 
> a bit harder to review so we're happy to accept feedback in any form.
>
> The idea is to convert the material to AsciiDoc at some point.
>
> Cheers,
> Lars
>
> (I didn't want to cross post to user@ as well but this is obviously not 
> limited to dev@ users)
>
> [1] 
> [2] 
> [3] You can see the content here 
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: New Spark Datasource for Hive ACID tables

2019-07-26 Thread Abhishek Somani
Hey Naresh,

Thanks for your question. Yes it will work!

Thanks,
Abhishek Somani

On Fri, Jul 26, 2019 at 7:08 PM naresh Goud 
wrote:

> Thanks Abhishek.
>
> Will it work on hive acid table which is not compacted ? i.e table having
> base and delta files?
>
> Let’s say hive acid table customer
>
> Create table customer(customer_id int, customer_name string,
> customer_email string) cluster by customer_id buckets 10 location
> ‘/test/customer’ tableproperties(transactional=true)
>
>
> And table hdfs path having below directories
>
> /test/customer/base_15234/
> /test/customer/delta_1234_456
>
>
> That means table having updates and major compaction not run.
>
> Will it spark reader works ?
>
>
> Thank you,
> Naresh
>
>
>
>
>
>
>
> On Fri, Jul 26, 2019 at 7:38 AM Abhishek Somani <
> abhisheksoman...@gmail.com> wrote:
>
>> Hi All,
>>
>> We at Qubole  have open sourced a datasource
>> that will enable users to work on their Hive ACID Transactional Tables
>> 
>> using Spark.
>>
>> Github: https://github.com/qubole/spark-acid
>>
>> Hive ACID tables allow users to work on their data transactionally, and
>> also gives them the ability to Delete, Update and Merge data efficiently
>> without having to rewrite all of their data in a table, partition or file.
>> We believe that being able to work on these tables from Spark is a much
>> desired value add, as is also apparent in
>> https://issues.apache.org/jira/browse/SPARK-15348 and
>> https://issues.apache.org/jira/browse/SPARK-16996 with multiple people
>> looking for it. Currently the datasource supports reading from these ACID
>> tables only, and we are working on adding the ability to write into these
>> tables via Spark as well.
>>
>> The datasource is also available as a spark package, and instructions on
>> how to use it are available on the Github page
>> .
>>
>> We welcome your feedback and suggestions.
>>
>> Thanks,
>> Abhishek Somani
>>
> --
> Thanks,
> Naresh
> www.linkedin.com/in/naresh-dulam
> http://hadoopandspark.blogspot.com/
>
>


Apache Training contribution for Spark - Feedback welcome

2019-07-26 Thread Lars Francke
Hi Spark community,

you may or may not have heard of a new-ish (February 2019) project at
Apache: Apache Training (incubating). We aim to develop training material
about various projects inside and outside the ASF: <
http://training.apache.org/>

One of our users wants to contribute material on Spark[1]

We've done something similar for ZooKeeper[1] in the past and the ZooKeeper
community provided excellent feedback which helped make the product much
better[3].

That's why I'd like to invite everyone here to provide any kind of feedback
on the content donation. It is currently in PowerPoint format which makes
it a bit harder to review so we're happy to accept feedback in any form.

The idea is to convert the material to AsciiDoc at some point.

Cheers,
Lars

(I didn't want to cross post to user@ as well but this is obviously not
limited to dev@ users)

[1] 7>
[2] 3>
[3] You can see the content here <
https://github.com/apache/incubator-training/blob/master/content/ZooKeeper/src/main/asciidoc/index_en.adoc
>


Re: New Spark Datasource for Hive ACID tables

2019-07-26 Thread naresh Goud
Thanks Abhishek.

Will it work on hive acid table which is not compacted ? i.e table having
base and delta files?

Let’s say hive acid table customer

Create table customer(customer_id int, customer_name string, customer_email
string) cluster by customer_id buckets 10 location ‘/test/customer’
tableproperties(transactional=true)


And table hdfs path having below directories

/test/customer/base_15234/
/test/customer/delta_1234_456


That means table having updates and major compaction not run.

Will it spark reader works ?


Thank you,
Naresh







On Fri, Jul 26, 2019 at 7:38 AM Abhishek Somani 
wrote:

> Hi All,
>
> We at Qubole  have open sourced a datasource
> that will enable users to work on their Hive ACID Transactional Tables
> 
> using Spark.
>
> Github: https://github.com/qubole/spark-acid
>
> Hive ACID tables allow users to work on their data transactionally, and
> also gives them the ability to Delete, Update and Merge data efficiently
> without having to rewrite all of their data in a table, partition or file.
> We believe that being able to work on these tables from Spark is a much
> desired value add, as is also apparent in
> https://issues.apache.org/jira/browse/SPARK-15348 and
> https://issues.apache.org/jira/browse/SPARK-16996 with multiple people
> looking for it. Currently the datasource supports reading from these ACID
> tables only, and we are working on adding the ability to write into these
> tables via Spark as well.
>
> The datasource is also available as a spark package, and instructions on
> how to use it are available on the Github page
> .
>
> We welcome your feedback and suggestions.
>
> Thanks,
> Abhishek Somani
>
-- 
Thanks,
Naresh
www.linkedin.com/in/naresh-dulam
http://hadoopandspark.blogspot.com/


New Spark Datasource for Hive ACID tables

2019-07-26 Thread Abhishek Somani
Hi All,

We at Qubole  have open sourced a datasource that
will enable users to work on their Hive ACID Transactional Tables
 using
Spark.

Github: https://github.com/qubole/spark-acid

Hive ACID tables allow users to work on their data transactionally, and
also gives them the ability to Delete, Update and Merge data efficiently
without having to rewrite all of their data in a table, partition or file.
We believe that being able to work on these tables from Spark is a much
desired value add, as is also apparent in
https://issues.apache.org/jira/browse/SPARK-15348 and
https://issues.apache.org/jira/browse/SPARK-16996 with multiple people
looking for it. Currently the datasource supports reading from these ACID
tables only, and we are working on adding the ability to write into these
tables via Spark as well.

The datasource is also available as a spark package, and instructions on
how to use it are available on the Github page
.

We welcome your feedback and suggestions.

Thanks,
Abhishek Somani


Re: Ask for ARM CI for spark

2019-07-26 Thread Sean Owen
Interesting. I don't think log(3) is special, it's just that some
differences in how it's implemented and floating-point values on
aarch64 vs x86, or in the JVM, manifest at some values like this. It's
still a little surprising! BTW Wolfram Alpha suggests that the correct
value is more like ...810969..., right between the two. java.lang.Math
doesn't guarantee strict IEEE floating-point behavior, but
java.lang.StrictMath is supposed to, at the potential cost of speed,
and it gives ...81096, in agreement with aarch64.

@Yuming Wang the results in float8.sql are from PostgreSQL directly?
Interesting if it also returns the same less accurate result, which
might suggest it's more to do with underlying OS math libraries. You
noted that these tests sometimes gave platform-dependent differences
in the last digit, so wondering if the test value directly reflects
PostgreSQL or just what we happen to return now.

One option is to use StrictMath in special cases like computing atanh.
That gives a value that agrees with aarch64.
I also note that 0.5 * (math.log(1 + x) - math.log(1 - x) gives the
more accurate answer too, and makes the result agree with, say,
Wolfram Alpha for atanh(0.5).
(Actually if we do that, better still is 0.5 * (math.log1p(x) -
math.log1p(-x)) for best accuracy near 0)
Commons Math also has implementations of sinh, cosh, atanh that we
could call. It claims it's possibly more accurate and faster. I
haven't tested its result here.

FWIW the "log1p" version appears, from some informal testing, to be
most accurate (in agreement with Wolfram) and using StrictMath doesn't
matter. If we change something, I'd use that version above.
The only issue is if this causes the result to disagree with
PostgreSQL, but then again it's more correct and maybe the DB is
wrong.


The rest may be a test vs PostgreSQL issue; see
https://issues.apache.org/jira/browse/SPARK-28316


On Fri, Jul 26, 2019 at 2:32 AM Tianhua huang  wrote:
>
> Hi, all
>
>
> Sorry to disturb again, there are several sql tests failed on arm64 instance:
>
> pgSQL/float8.sql *** FAILED ***
> Expected "0.549306144334054[9]", but got "0.549306144334054[8]" Result did 
> not match for query #56
> SELECT atanh(double('0.5')) (SQLQueryTestSuite.scala:362)
> pgSQL/numeric.sql *** FAILED ***
> Expected "2 2247902679199174[72 224790267919917955.1326161858
> 4 7405685069595001 7405685069594999.0773399947
> 5 5068226527.321263 5068226527.3212726541
> 6 281839893606.99365 281839893606.9937234336
> 7 1716699575118595840 1716699575118597095.4233081991
> 8 167361463828.0749 167361463828.0749132007
> 9 107511333880051856] 107511333880052007", but got "2 
> 2247902679199174[40224790267919917955.1326161858
> 4 7405685069595001 7405685069594999.0773399947
> 5 5068226527.321263 5068226527.3212726541
> 6 281839893606.99365 281839893606.9937234336
> 7 1716699575118595580 1716699575118597095.4233081991
> 8 167361463828.0749 167361463828.0749132007
> 9 107511333880051872] 107511333880052007" Result did not match for query 
> #496
> SELECT t1.id1, t1.result, t2.expected
> FROM num_result t1, num_exp_power_10_ln t2
> WHERE t1.id1 = t2.id
> AND t1.result != t2.expected (SQLQueryTestSuite.scala:362)
>
> The first test failed, because the value of math.log(3.0) is different on 
> aarch64:
>
> # on x86_64:
>
> scala> val a = 0.5
> a: Double = 0.5
>
> scala> a * math.log((1.0 + a) / (1.0 - a))
> res1: Double = 0.5493061443340549
>
> scala> math.log((1.0 + a) / (1.0 - a))
> res2: Double = 1.0986122886681098
>
> # on aarch64:
>
> scala> val a = 0.5
>
> a: Double = 0.5
>
> scala> a * math.log((1.0 + a) / (1.0 - a))
>
> res20: Double = 0.5493061443340548
>
> scala> math.log((1.0 + a) / (1.0 - a))
>
> res21: Double = 1.0986122886681096
>
> And I tried other several numbers like math.log(4.0) and math.log(5.0) and 
> they are same, I don't know why math.log(3.0) is so special? But the result 
> is different indeed on aarch64. If you are interesting, please try it.
>
> The second test failed, because some values of pow(10, x) is different on 
> aarch64, according to sql tests of spark, I took similar tests on aarch64 and 
> x86_64, take '-83028485' as example:
>
> # on x86_64:
> scala> import java.lang.Math._
> import java.lang.Math._
> scala> var a = -83028485
> a: Int = -83028485
> scala> abs(a)
> res4: Int = 83028485
> scala> math.log(abs(a))
> res5: Double = 18.234694299654787
> scala> pow(10, math.log(abs(a)))
> res6: Double = 1.71669957511859584E18
>
> # on aarch64:
>
> scala> var a = -83028485
> a: Int = -83028485
> scala> abs(a)
> res38: Int = 83028485
>
> scala> math.log(abs(a))
>
> res39: Double = 18.234694299654787
> scala> pow(10, math.log(abs(a)))
> res40: Double = 1.71669957511859558E18
>
> I send an email to jdk-dev, hope someone can help, and also I proposed this 
> to JIRA  https://issues.apache.org/jira/browse/SPARK-28519, , if you are 
> interesting, welcome to join and discuss, thank you very much.
>
>
> On Thu, Jul 18, 2019 at 11:12 AM Tianhua huang  
> 

Re: Ask for ARM CI for spark

2019-07-26 Thread Tianhua huang
Hi, all


Sorry to disturb again, there are several sql tests failed on arm64
instance:

   - pgSQL/float8.sql *** FAILED ***
   Expected "0.549306144334054[9]", but got "0.549306144334054[8]" Result
   did not match for query #56
   SELECT atanh(double('0.5')) (SQLQueryTestSuite.scala:362)
   - pgSQL/numeric.sql *** FAILED ***
   Expected "2 2247902679199174[72 224790267919917955.1326161858
   4 7405685069595001 7405685069594999.0773399947
   5 5068226527.321263 5068226527.3212726541
   6 281839893606.99365 281839893606.9937234336
   7 1716699575118595840 1716699575118597095.4233081991
   8 167361463828.0749 167361463828.0749132007
   9 107511333880051856] 107511333880052007", but got "2
   2247902679199174[40224790267919917955.1326161858
   4 7405685069595001 7405685069594999.0773399947
   5 5068226527.321263 5068226527.3212726541
   6 281839893606.99365 281839893606.9937234336
   7 1716699575118595580 1716699575118597095.4233081991
   8 167361463828.0749 167361463828.0749132007
   9 107511333880051872] 107511333880052007" Result did not match for
   query #496
   SELECT t1.id1, t1.result, t2.expected
   FROM num_result t1, num_exp_power_10_ln t2
   WHERE t1.id1 = t2.id
   AND t1.result != t2.expected (SQLQueryTestSuite.scala:362)

The first test failed, because the value of math.log(3.0) is different on
aarch64:

# on x86_64:
scala> val a = 0.5
a: Double = 0.5

scala> a * math.log((1.0 + a) / (1.0 - a))
res1: Double = 0.5493061443340549

scala> math.log((1.0 + a) / (1.0 - a))
res2: Double = 1.0986122886681098

# on aarch64:

scala> val a = 0.5

a: Double = 0.5

scala> a * math.log((1.0 + a) / (1.0 - a))
res20: Double = 0.5493061443340548

scala> math.log((1.0 + a) / (1.0 - a))

res21: Double = 1.0986122886681096

And I tried other several numbers like math.log(4.0) and math.log(5.0) and
they are same, I don't know why math.log(3.0) is so special? But the result
is different indeed on aarch64. If you are interesting, please try it.

The second test failed, because some values of pow(10, x) is different on
aarch64, according to sql tests of spark, I took similar tests on aarch64
and x86_64, take '-83028485' as example:

# on x86_64:
scala> import java.lang.Math._
import java.lang.Math._
scala> var a = -83028485
a: Int = -83028485
scala> abs(a)
res4: Int = 83028485
scala> math.log(abs(a))
res5: Double = 18.234694299654787
scala> pow(10, math.log(abs(a)))
res6: Double = 1.71669957511859584E18

# on aarch64:

scala> var a = -83028485
a: Int = -83028485
scala> abs(a)
res38: Int = 83028485

scala> math.log(abs(a))

res39: Double = 18.234694299654787
scala> pow(10, math.log(abs(a)))
res40: Double = 1.71669957511859558E18

I send an email to jdk-dev, hope someone can help, and also I proposed this
to JIRA  https://issues.apache.org/jira/browse/SPARK-28519, , if you are
interesting, welcome to join and discuss, thank you very much.

On Thu, Jul 18, 2019 at 11:12 AM Tianhua huang 
wrote:

> Thanks for your reply.
>
> About the first problem we didn't find any other reason in log, just found
> timeout to wait the executor up, and after increase the timeout from 1
> ms to 3(even 2)ms,
> https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/SparkContextSuite.scala#L764
>
> https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/SparkContextSuite.scala#L792
> the test passed, and there are more than one executor up, not sure whether
> it's related with the flavor of our aarch64 instance? Now the flavor of the
> instance is 8C8G. Maybe we will try the bigger flavor later. Or any one has
> other suggestion, please contact me, thank you.
>
> About the second problem, I proposed a pull request to apache/spark,
> https://github.com/apache/spark/pull/25186  if you have time, would you
> please to help to review it, thank you very much.
>
> On Wed, Jul 17, 2019 at 8:37 PM Sean Owen  wrote:
>
>> On Wed, Jul 17, 2019 at 6:28 AM Tianhua huang 
>> wrote:
>> > Two failed and the reason is 'Can't find 1 executors before 1
>> milliseconds elapsed', see below, then we try increase timeout the tests
>> passed, so wonder if we can increase the timeout? and here I have another
>> question about
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/TestUtils.scala#L285,
>> why is not >=? see the comment of the function, it should be >=?
>> >
>>
>> I think it's ">" because the driver is also an executor, but not 100%
>> sure. In any event it passes in general.
>> These errors typically mean "I didn't start successfully" for some
>> other reason that may be in the logs.
>>
>> > The other two failed and the reason is '2143289344 equaled 2143289344',
>> this because the value of floatToRawIntBits(0.0f/0.0f) on aarch64 platform
>> is 2143289344 and equals to floatToRawIntBits(Float.NaN). About this I send
>> email to jdk-dev and proposed a topic on scala community
>>