[ 
https://issues.apache.org/jira/browse/SPARK-24288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24288:
----------------------------
    Description: 
Issue discussed on Mailing List: 
[http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html]

While working with JDBC datasource I saw that many "or" clauses with 
non-equality operators causes huge performance degradation of SQL query 
to database (DB2). For example: 

val df = spark.read.format("jdbc").(other options to parallelize 
load).load() 

df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x 
 > 100)").show() // in real application whose predicates were pushed 
many many lines below, many ANDs and ORs 

If I use cache() before where, there is no predicate pushdown of this 
"where" clause. However, in production system caching many sources is a 
waste of memory (especially is pipeline is long and I must do cache many 
times).There are also few more workarounds, but it would be great if Spark will 
support preventing predicate pushdown by user.

 

For example: df.withAnalysisBarrier().where(...) ?

 

Note, that this should not be a global configuration option. If I read 2 
DataFrames, df1 and df2, I would like to specify that df1 should not have some 
predicates pushed down, but some may be, but df2 should have all predicates 
pushed down, even if target query joins df1 and df2. As far as I understand 
Spark optimizer, if we use functions like `withAnalysisBarrier` and put 
AnalysisBarrier explicitly in logical plan, then predicates won't be pushed 
down on this particular DataFrames and PP will be still possible on the second 
one.


Update: The solution is to add a JDBC Option "pushDownPredicate" (default true) 
to allow/disallow predicate push-down in JDBC data source.


  was:
Issue discussed on Mailing List: 
[http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html]

While working with JDBC datasource I saw that many "or" clauses with 
non-equality operators causes huge performance degradation of SQL query 
to database (DB2). For example: 

val df = spark.read.format("jdbc").(other options to parallelize 
load).load() 

df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x 
 > 100)").show() // in real application whose predicates were pushed 
many many lines below, many ANDs and ORs 

If I use cache() before where, there is no predicate pushdown of this 
"where" clause. However, in production system caching many sources is a 
waste of memory (especially is pipeline is long and I must do cache many 
times).There are also few more workarounds, but it would be great if Spark will 
support preventing predicate pushdown by user.

 

For example: df.withAnalysisBarrier().where(...) ?

 

Note, that this should not be a global configuration option. If I read 2 
DataFrames, df1 and df2, I would like to specify that df1 should not have some 
predicates pushed down, but some may be, but df2 should have all predicates 
pushed down, even if target query joins df1 and df2. As far as I understand 
Spark optimizer, if we use functions like `withAnalysisBarrier` and put 
AnalysisBarrier explicitly in logical plan, then predicates won't be pushed 
down on this particular DataFrames and PP will be still possible on the second 
one.


> Enable preventing predicate pushdown
> ------------------------------------
>
>                 Key: SPARK-24288
>                 URL: https://issues.apache.org/jira/browse/SPARK-24288
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Tomasz Gawęda
>            Assignee: Maryann Xue
>            Priority: Major
>             Fix For: 2.4.0
>
>         Attachments: SPARK-24288.simple.patch
>
>
> Issue discussed on Mailing List: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html]
> While working with JDBC datasource I saw that many "or" clauses with 
> non-equality operators causes huge performance degradation of SQL query 
> to database (DB2). For example: 
> val df = spark.read.format("jdbc").(other options to parallelize 
> load).load() 
> df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x 
>  > 100)").show() // in real application whose predicates were pushed 
> many many lines below, many ANDs and ORs 
> If I use cache() before where, there is no predicate pushdown of this 
> "where" clause. However, in production system caching many sources is a 
> waste of memory (especially is pipeline is long and I must do cache many 
> times).There are also few more workarounds, but it would be great if Spark 
> will support preventing predicate pushdown by user.
>  
> For example: df.withAnalysisBarrier().where(...) ?
>  
> Note, that this should not be a global configuration option. If I read 2 
> DataFrames, df1 and df2, I would like to specify that df1 should not have 
> some predicates pushed down, but some may be, but df2 should have all 
> predicates pushed down, even if target query joins df1 and df2. As far as I 
> understand Spark optimizer, if we use functions like `withAnalysisBarrier` 
> and put AnalysisBarrier explicitly in logical plan, then predicates won't be 
> pushed down on this particular DataFrames and PP will be still possible on 
> the second one.
> Update: The solution is to add a JDBC Option "pushDownPredicate" (default 
> true) to allow/disallow predicate push-down in JDBC data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to