[
https://issues.apache.org/jira/browse/SPARK-24288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiao Li updated SPARK-24288:
----------------------------
Description:
Issue discussed on Mailing List:
[http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html]
While working with JDBC datasource I saw that many "or" clauses with
non-equality operators causes huge performance degradation of SQL query
to database (DB2). For example:
val df = spark.read.format("jdbc").(other options to parallelize
load).load()
df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x
> 100)").show() // in real application whose predicates were pushed
many many lines below, many ANDs and ORs
If I use cache() before where, there is no predicate pushdown of this
"where" clause. However, in production system caching many sources is a
waste of memory (especially is pipeline is long and I must do cache many
times).There are also few more workarounds, but it would be great if Spark will
support preventing predicate pushdown by user.
For example: df.withAnalysisBarrier().where(...) ?
Note, that this should not be a global configuration option. If I read 2
DataFrames, df1 and df2, I would like to specify that df1 should not have some
predicates pushed down, but some may be, but df2 should have all predicates
pushed down, even if target query joins df1 and df2. As far as I understand
Spark optimizer, if we use functions like `withAnalysisBarrier` and put
AnalysisBarrier explicitly in logical plan, then predicates won't be pushed
down on this particular DataFrames and PP will be still possible on the second
one.
Update: The solution is to add a JDBC Option "pushDownPredicate" (default true)
to allow/disallow predicate push-down in JDBC data source.
was:
Issue discussed on Mailing List:
[http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html]
While working with JDBC datasource I saw that many "or" clauses with
non-equality operators causes huge performance degradation of SQL query
to database (DB2). For example:
val df = spark.read.format("jdbc").(other options to parallelize
load).load()
df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x
> 100)").show() // in real application whose predicates were pushed
many many lines below, many ANDs and ORs
If I use cache() before where, there is no predicate pushdown of this
"where" clause. However, in production system caching many sources is a
waste of memory (especially is pipeline is long and I must do cache many
times).There are also few more workarounds, but it would be great if Spark will
support preventing predicate pushdown by user.
For example: df.withAnalysisBarrier().where(...) ?
Note, that this should not be a global configuration option. If I read 2
DataFrames, df1 and df2, I would like to specify that df1 should not have some
predicates pushed down, but some may be, but df2 should have all predicates
pushed down, even if target query joins df1 and df2. As far as I understand
Spark optimizer, if we use functions like `withAnalysisBarrier` and put
AnalysisBarrier explicitly in logical plan, then predicates won't be pushed
down on this particular DataFrames and PP will be still possible on the second
one.
> Enable preventing predicate pushdown
> ------------------------------------
>
> Key: SPARK-24288
> URL: https://issues.apache.org/jira/browse/SPARK-24288
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Affects Versions: 2.3.0
> Reporter: Tomasz Gawęda
> Assignee: Maryann Xue
> Priority: Major
> Fix For: 2.4.0
>
> Attachments: SPARK-24288.simple.patch
>
>
> Issue discussed on Mailing List:
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html]
> While working with JDBC datasource I saw that many "or" clauses with
> non-equality operators causes huge performance degradation of SQL query
> to database (DB2). For example:
> val df = spark.read.format("jdbc").(other options to parallelize
> load).load()
> df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x
> > 100)").show() // in real application whose predicates were pushed
> many many lines below, many ANDs and ORs
> If I use cache() before where, there is no predicate pushdown of this
> "where" clause. However, in production system caching many sources is a
> waste of memory (especially is pipeline is long and I must do cache many
> times).There are also few more workarounds, but it would be great if Spark
> will support preventing predicate pushdown by user.
>
> For example: df.withAnalysisBarrier().where(...) ?
>
> Note, that this should not be a global configuration option. If I read 2
> DataFrames, df1 and df2, I would like to specify that df1 should not have
> some predicates pushed down, but some may be, but df2 should have all
> predicates pushed down, even if target query joins df1 and df2. As far as I
> understand Spark optimizer, if we use functions like `withAnalysisBarrier`
> and put AnalysisBarrier explicitly in logical plan, then predicates won't be
> pushed down on this particular DataFrames and PP will be still possible on
> the second one.
> Update: The solution is to add a JDBC Option "pushDownPredicate" (default
> true) to allow/disallow predicate push-down in JDBC data source.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]