[RESULT] [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-18 Thread Mridul Muralidharan
Hi,

  The vote passed with 16 +1's (6 binding) and no -1's

+1s (* = binding):

Xingbo Jiang
Venkatakrishnan Sowrirajan
Tom Graves (*)
Chandni Singh
DB Tsai (*)
Xiao Li (*)
Angers Zhu
Joseph Torres
Kalyan
Dongjoon Hyun (*)
Wenchen Fan (*)
Yi Wu
叶先进 
郑瑞峰 
Takeshi Yamamuro
Mridul Muralidharan (*)

Thanks,
Mridul


Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-18 Thread Mridul Muralidharan
Adding my +1 as well, before closing the vote.

Regards,
Mridul

On Sun, Sep 13, 2020 at 9:59 PM Mridul Muralidharan 
wrote:

> Hi,
>
> I'd like to call for a vote on SPARK-30602 - SPIP: Support push-based
> shuffle to improve shuffle efficiency.
> Please take a look at:
>
>- SPIP jira: https://issues.apache.org/jira/browse/SPARK-30602
>- SPIP doc:
>
> https://docs.google.com/document/d/1mYzKVZllA5Flw8AtoX7JUcXBOnNIDADWRbJ7GI6Y71Q/edit
>- POC against master and results summary :
>
> https://docs.google.com/document/d/1Q5m7YAp0HyG_TNFL4p_bjQgzzw33ik5i49Vr86UNZgg/edit
>
> Active discussions on the jira and SPIP document have settled.
>
> I will leave the vote open until Friday (the 18th September 2020), 5pm
> CST.
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don't think this is a good idea because ...
>
>
> Thanks,
> Mridul
>


Inconsistencies with how catalyst optimizer handles non-deterministic expressions

2020-09-18 Thread tanelk
Hello,

I believe, that currently non-deterministic expressions are handled in two
conflicting approaches in the catalyst optimizer.

The first approach is the one I have seen in the recent pull request reviews
- the optimizer should never change the number of times a non-deterministic
expression is executed. A good example of this is /`Canonicalize.scala`/:
* In addition and multiplication we allow reordering non-deterministic
expressions, because both sides will be evaluated anyways. 
* In boolean OR and AND we *do not* allow reordering non-deterministic
expressions, because the right side might not be evaluated.

Then there is another approach, where we allow reordering non-deterministic
expressions even in boolean OR and AND. A good example of this is the
/`PushPredicateThroughJoin`/ rule where we use the
/`condition.partition(_.deterministic)`/ pattern. Later the partitioned
expressions can be concatenated back, but this effectively changes the order
of execution and can make some non-deterministic expressions be not
evaluated on all the rows they would have been.

Initially I was sure, that the second approach is wrong and was about to
make a pull request to fix this. But then I found that this has not been an
accidental mistake, but it is done so on purpose: 
https://github.com/apache/spark/pull/20069
  .

I'm sure that both of these approaches have good arguments for them. In my
eyes:
* The first one allows users be more sure on how their stateful expressions
are evaluated - optimizer does not change the output.
* The second one allows catalyst to do better optimization.
But, by mixing both of them we get the worst of the both worlds - users
can't be sure about how the expressions are evaluated and we don't have the
"most optimal" queries. 

 What is the community's stance on this issue?

Regards,
Tanel



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



RE: Spark-Locality: Hinting Spark location of the executor does not take effect.

2020-09-18 Thread Nasrulla Khan Haris
Was providing IP address instead of FQDN. Providing FQDN helped.

Thanks,

From: Nasrulla Khan Haris
Sent: Wednesday, September 16, 2020 4:11 PM
To: dev@spark.apache.org
Subject: Spark-Locality: Hinting Spark location of the executor does not take 
effect.

HI Spark developers,

If I want to hint spark to use particular list of hosts to execute tasks on. I 
see that getBlockLocations is used to get the list of hosts from HDFS.

https://github.com/apache/spark/blob/7955b3962ac46b89564e0613db7bea98a1478bf2/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L386


Hinting Spark by custom getBlockLocation which return Array of BlockLocations 
with host ip address doesn’t help, Spark continues to host it on other 
executors hosts.

Is there something I am doing wrong ?

Test:
Spark.read.csv()


Appreciate your inputs 

Thanks,
Nasrulla



Pre query execution hook for custom datasources

2020-09-18 Thread Shubham Chaurasia
Hi,

In our custom datasource implementation, we want to inject some query level
information.

For example -

scala> val df = spark.sql("some query")   // uses custom datasource under
the hood through Session Extensions.

scala> df.count //  here we want some kind of pre execution hook just
before the query starts it's execution

Is there a hook or some kind of callback that we can implement to achieve
this?

Or similar to org.apache.spark.sql.util.QueryExecutionListener which
provides callbacks for onSuccess and onFailure when query finishes, we want
something like "*beforeStart()*".

Any ideas on how to implement this?

Thanks,
Shubham