Re: [VOTE] Spark 2.3.1 (RC1)

2018-05-15 Thread Xiao Li
-1

We have a correctness bug fix that was merged after 2.3 RC1. It would be
nice to have that in Spark 2.3.1 release.

https://issues.apache.org/jira/browse/SPARK-24259

Xiao


2018-05-15 14:00 GMT-07:00 Marcelo Vanzin :

> Please vote on releasing the following candidate as Apache Spark version
> 2.3.1.
>
> The vote is open until Friday, May 18, at 21:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.3.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.3.1-rc1 (commit cc93bc95):
> https://github.com/apache/spark/tree/v2.3.0-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1269/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-docs/
>
> The list of bug fixes going into 2.3.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342432
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.1?
> ===
>
> The current list of open tickets targeted at 2.3.1 can be found at:
> https://s.apache.org/Q3Uo
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Spark 2.3.1 (RC1)

2018-05-15 Thread Marcelo Vanzin
It's in. That link is only a list of the currently open bugs.

On Tue, May 15, 2018 at 2:02 PM, Justin Miller
 wrote:
> Did SPARK-24067 not make it in? I don’t see it in https://s.apache.org/Q3Uo.
>
> Thanks,
> Justin
>
> On May 15, 2018, at 3:00 PM, Marcelo Vanzin  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.3.1.
>
> The vote is open until Friday, May 18, at 21:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.3.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.3.1-rc1 (commit cc93bc95):
> https://github.com/apache/spark/tree/v2.3.0-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1269/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-docs/
>
> The list of bug fixes going into 2.3.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342432
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.1?
> ===
>
> The current list of open tickets targeted at 2.3.1 can be found at:
> https://s.apache.org/Q3Uo
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Spark 2.3.1 (RC1)

2018-05-15 Thread Justin Miller
Did SPARK-24067 not make it in? I don’t see it in https://s.apache.org/Q3Uo 
.

Thanks,
Justin

> On May 15, 2018, at 3:00 PM, Marcelo Vanzin  wrote:
> 
> Please vote on releasing the following candidate as Apache Spark version 
> 2.3.1.
> 
> The vote is open until Friday, May 18, at 21:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Spark 2.3.1
> [ ] -1 Do not release this package because ...
> 
> To learn more about Apache Spark, please see http://spark.apache.org/
> 
> The tag to be voted on is v2.3.1-rc1 (commit cc93bc95):
> https://github.com/apache/spark/tree/v2.3.0-rc1
> 
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-bin/
> 
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1269/
> 
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-docs/
> 
> The list of bug fixes going into 2.3.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342432
> 
> FAQ
> 
> =
> How can I help test this release?
> =
> 
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
> 
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
> 
> ===
> What should happen to JIRA tickets still targeting 2.3.1?
> ===
> 
> The current list of open tickets targeted at 2.3.1 can be found at:
> https://s.apache.org/Q3Uo
> 
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
> 
> ==
> But my bug isn't fixed?
> ==
> 
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
> 
> 
> -- 
> Marcelo
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 



Re: [VOTE] Spark 2.3.1 (RC1)

2018-05-15 Thread Marcelo Vanzin
I'll start with my +1 (binding). I've ran unit tests and a bunch of
integration tests on the hadoop-2.7 package.

Please note that there are still a few flaky tests. Please check jira
before you decide to send a -1 because of a flaky test.

Also, apologies for the delay in getting the RC ready. Still learning
the ropes. Also, if you plan on doing this in the future, *do not* do
"svn co" on the dist.apache.org repo. The ASF Infra folks will not be
very kind to you. I'll update our RM docs later.


On Tue, May 15, 2018 at 2:00 PM, Marcelo Vanzin  wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 2.3.1.
>
> The vote is open until Friday, May 18, at 21:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.3.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.3.1-rc1 (commit cc93bc95):
> https://github.com/apache/spark/tree/v2.3.0-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1269/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-docs/
>
> The list of bug fixes going into 2.3.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342432
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.1?
> ===
>
> The current list of open tickets targeted at 2.3.1 can be found at:
> https://s.apache.org/Q3Uo
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
> --
> Marcelo



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[VOTE] Spark 2.3.1 (RC1)

2018-05-15 Thread Marcelo Vanzin
Please vote on releasing the following candidate as Apache Spark version 2.3.1.

The vote is open until Friday, May 18, at 21:00 UTC and passes if
a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.3.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.3.1-rc1 (commit cc93bc95):
https://github.com/apache/spark/tree/v2.3.0-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1269/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-docs/

The list of bug fixes going into 2.3.1 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12342432

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.3.1?
===

The current list of open tickets targeted at 2.3.1 can be found at:
https://s.apache.org/Q3Uo

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Preventing predicate pushdown

2018-05-15 Thread Tomasz Gawęda
Thanks, filled https://issues.apache.org/jira/browse/SPARK-24288

Pozdrawiam / Best regards,

Tomek

On 2018-05-15 18:29, Wenchen Fan wrote:
applying predict pushdown is an optimization, and it makes sense to provide 
configs to turn off certain optimizations. Feel free to create a JIRA.

Thanks,
Wenchen

On Tue, May 15, 2018 at 8:33 PM, Tomasz Gawęda 
> wrote:
Hi,

while working with JDBC datasource I saw that many "or" clauses with
non-equality operators causes huge performance degradation of SQL query
to database (DB2). For example:

val df = spark.read.format("jdbc").(other options to parallelize
load).load()

df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x
 > 100)").show() // in real application whose predicates were pushed
many many lines below, many ANDs and ORs

If I use cache() before where, there is no predicate pushdown of this
"where" clause. However, in production system caching many sources is a
waste of memory (especially is pipeline is long and I must do cache many
times).


I asked on StackOverflow for better ideas:
https://stackoverflow.com/questions/50336355/how-to-prevent-predicate-pushdown

However, there are only workarounds. I can use those workarounds, but
maybe it would be better to add such functionality directly in the API?

For example: df.withAnalysisBarrier().where(...) ?

Please let me know if I should create a JIRA or it's not a good idea for
some reasons.


Pozdrawiam / Best regards,

Tomek Gawęda





Re: Preventing predicate pushdown

2018-05-15 Thread Wenchen Fan
applying predict pushdown is an optimization, and it makes sense to provide
configs to turn off certain optimizations. Feel free to create a JIRA.

Thanks,
Wenchen

On Tue, May 15, 2018 at 8:33 PM, Tomasz Gawęda 
wrote:

> Hi,
>
> while working with JDBC datasource I saw that many "or" clauses with
> non-equality operators causes huge performance degradation of SQL query
> to database (DB2). For example:
>
> val df = spark.read.format("jdbc").(other options to parallelize
> load).load()
>
> df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x
>  > 100)").show() // in real application whose predicates were pushed
> many many lines below, many ANDs and ORs
>
> If I use cache() before where, there is no predicate pushdown of this
> "where" clause. However, in production system caching many sources is a
> waste of memory (especially is pipeline is long and I must do cache many
> times).
>
>
> I asked on StackOverflow for better ideas:
> https://stackoverflow.com/questions/50336355/how-to-
> prevent-predicate-pushdown
>
> However, there are only workarounds. I can use those workarounds, but
> maybe it would be better to add such functionality directly in the API?
>
> For example: df.withAnalysisBarrier().where(...) ?
>
> Please let me know if I should create a JIRA or it's not a good idea for
> some reasons.
>
>
> Pozdrawiam / Best regards,
>
> Tomek Gawęda
>
>


Preventing predicate pushdown

2018-05-15 Thread Tomasz Gawęda
Hi,

while working with JDBC datasource I saw that many "or" clauses with 
non-equality operators causes huge performance degradation of SQL query 
to database (DB2). For example:

val df = spark.read.format("jdbc").(other options to parallelize 
load).load()

df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x 
 > 100)").show() // in real application whose predicates were pushed 
many many lines below, many ANDs and ORs

If I use cache() before where, there is no predicate pushdown of this 
"where" clause. However, in production system caching many sources is a 
waste of memory (especially is pipeline is long and I must do cache many 
times).


I asked on StackOverflow for better ideas: 
https://stackoverflow.com/questions/50336355/how-to-prevent-predicate-pushdown

However, there are only workarounds. I can use those workarounds, but 
maybe it would be better to add such functionality directly in the API?

For example: df.withAnalysisBarrier().where(...) ?

Please let me know if I should create a JIRA or it's not a good idea for 
some reasons.


Pozdrawiam / Best regards,

Tomek Gawęda



Re: Sort-merge join improvement

2018-05-15 Thread Petar Zecevic
Based on some reviews I put additional effort into fixing the case when 
wholestage codegen is turned off.


Sort-merge join with additional range conditions is now 10x faster (can 
be more or less, depending on exact use-case) in both cases - with 
wholestage turned off or on - compared to non-optimized SMJ.


Merging this would help us tremendously and I believe this can be useful 
in other applications, too.


Can you please review (https://github.com/apache/spark/pull/21109) and 
merge the patch?


Thank you,

Petar Zecevic


Le 4/23/2018 à 6:28 PM, Petar Zecevic a écrit :

Hi,

the PR tests completed successfully
(https://github.com/apache/spark/pull/21109).

Can you please review the patch and merge it upstream if you think it's OK?

Thanks,

Petar


Le 4/18/2018 à 4:52 PM, Petar Zecevic a écrit :

As instructed offline, I opened a JIRA for this:

https://issues.apache.org/jira/browse/SPARK-24020

I will create a pull request soon.


Le 4/17/2018 à 6:21 PM, Petar Zecevic a écrit :

Hello everybody

We (at University of Zagreb and University of Washington) have
implemented an optimization of Spark's sort-merge join (SMJ) which has
improved performance of our jobs considerably and we would like to know
if Spark community thinks it would be useful to include this in the main
distribution.

The problem we are solving is the case where you have two big tables
partitioned by X column, but also sorted by Y column (within partitions)
and you need to calculate an expensive function on the joined rows.
During a sort-merge join, Spark will do cross-joins of all rows that
have the same X values and calculate the function's value on all of
them. If the two tables have a large number of rows per X, this can
result in a huge number of calculations.

Our optimization allows you to reduce the number of matching rows per X
using a range condition on Y columns of the two tables. Something like:

... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d

The way SMJ is currently implemented, these extra conditions have no
influence on the number of rows (per X) being checked because these
extra conditions are put in the same block with the function being
calculated.

Our optimization changes the sort-merge join so that, when these extra
conditions are specified, a queue is used instead of the
ExternalAppendOnlyUnsafeRowArray class. This queue is then used as a
moving window across the values from the right relation as the left row
changes. You could call this a combination of an equi-join and a theta
join (we call it "sort-merge inner range join").

Potential use-cases for this are joins based on spatial or temporal
distance calculations.

The optimization is triggered automatically when an equi-join expression
is present AND lower and upper range conditions on a secondary column
are specified. If the tables aren't sorted by both columns, appropriate
sorts will be added.


We have several questions:

1. Do you see any other way to optimize queries like these (eliminate
unnecessary calculations) without changing the sort-merge join algorithm?

2. We believe there is a more general pattern here and that this could
help in other similar situations where secondary sorting is available.
Would you agree?

3. Would you like us to open a JIRA ticket and create a pull request?

Thanks,

Petar Zecevic



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Integrating ML/DL frameworks with Spark

2018-05-15 Thread Bryan Cutler
Thanks for starting this discussion, I'd also like to see some improvements
in this area and glad to hear that the Pandas UDFs / Arrow functionality
might be useful.  I'm wondering if from your initial investigations you
found anything lacking from the Arrow format or possible improvements that
would simplify the data representation?  Also, while data could be handed
off in a UDF, would it make sense to also discuss a more formal way to
externalize the data in a way that would also work for the Scala API?

Thanks,
Bryan

On Wed, May 9, 2018 at 4:31 PM, Xiangrui Meng  wrote:

> Shivaram: Yes, we can call it "gang scheduling" or "barrier
> synchronization". Spark doesn't support it now. The proposal is to have a
> proper support in Spark's job scheduler, so we can integrate well with
> MPI-like frameworks.
>
>
> On Tue, May 8, 2018 at 11:17 AM Nan Zhu  wrote:
>
>> .how I skipped the last part
>>
>> On Tue, May 8, 2018 at 11:16 AM, Reynold Xin  wrote:
>>
>>> Yes, Nan, totally agree. To be on the same page, that's exactly what I
>>> wrote wasn't it?
>>>
>>> On Tue, May 8, 2018 at 11:14 AM Nan Zhu  wrote:
>>>
 besides that, one of the things which is needed by multiple frameworks
 is to schedule tasks in a single wave

 i.e.

 if some frameworks like xgboost/mxnet requires 50 parallel workers,
 Spark is desired to provide a capability to ensure that either we run 50
 tasks at once, or we should quit the complete application/job after some
 timeout period

 Best,

 Nan

 On Tue, May 8, 2018 at 11:10 AM, Reynold Xin 
 wrote:

> I think that's what Xiangrui was referring to. Instead of retrying a
> single task, retry the entire stage, and the entire stage of tasks need to
> be scheduled all at once.
>
>
> On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>>
>>>
- Fault tolerance and execution model: Spark assumes
fine-grained task recovery, i.e. if something fails, only that task 
 is
rerun. This doesn’t match the execution model of distributed ML/DL
frameworks that are typically MPI-based, and rerunning a single 
 task would
lead to the entire system hanging. A whole stage needs to be re-run.

 This is not only useful for integrating with 3rd-party frameworks,
>>> but also useful for scaling MLlib algorithms. One of my earliest 
>>> attempts
>>> in Spark MLlib was to implement All-Reduce primitive (SPARK-1485
>>> ). But we ended
>>> up with some compromised solutions. With the new execution model, we can
>>> set up a hybrid cluster and do all-reduce properly.
>>>
>>>
>> Is there a particular new execution model you are referring to or do
>> we plan to investigate a new execution model ?  For the MPI-like model, 
>> we
>> also need gang scheduling (i.e. schedule all tasks at once or none of 
>> them)
>> and I dont think we have support for that in the scheduler right now.
>>
>>>
 --
>>>
>>> Xiangrui Meng
>>>
>>> Software Engineer
>>>
>>> Databricks Inc. [image: http://databricks.com]
>>> 
>>>
>>
>>

>> --
>
> Xiangrui Meng
>
> Software Engineer
>
> Databricks Inc. [image: http://databricks.com] 
>