[ 
https://issues.apache.org/jira/browse/SPARK-24288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16476900#comment-16476900
 ] 

Maryann Xue commented on SPARK-24288:
-------------------------------------

After taking a closer look at the problem, I realized that we should probably 
define the scope of the problem first. The issue [~TomaszGaweda] encountered 
specifically is predicates being pushed down to the data source; while a more 
general topic is the isolation of plan optimization/transformation between 
operators, somewhat like the effect "cache()" would achieve.

1) So for the first specific purpose, i.e., avoid predicate push-down to the 
data source, we could have an option when initiating the {{DataSource}}, e.g., 
by calling  {{spark.read.option()}} or {{spark.jdbc()}} with options, etc. This 
might not cover all cases of DataSource creation, but would be a workable 
solution for users.

2) For the more general purpose (which means we can stop push-down at any 
level), we could add a special logical operator (let's call it OptimizerBarrier 
maybe) that serves as a barrier node when performing optimizations. Then we 
could simply ditch this node at the time of physical plan transformation.

I personally prefer the second solution, which serves a more general purpose 
and may require less code changes (did a quick experiment and it worked). 
Please let me know your thoughts, [~smilegator], [~TomaszGaweda].

> Enable preventing predicate pushdown
> ------------------------------------
>
>                 Key: SPARK-24288
>                 URL: https://issues.apache.org/jira/browse/SPARK-24288
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Tomasz Gawęda
>            Priority: Major
>
> Issue discussed on Mailing List: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html]
> While working with JDBC datasource I saw that many "or" clauses with 
> non-equality operators causes huge performance degradation of SQL query 
> to database (DB2). For example: 
> val df = spark.read.format("jdbc").(other options to parallelize 
> load).load() 
> df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x 
>  > 100)").show() // in real application whose predicates were pushed 
> many many lines below, many ANDs and ORs 
> If I use cache() before where, there is no predicate pushdown of this 
> "where" clause. However, in production system caching many sources is a 
> waste of memory (especially is pipeline is long and I must do cache many 
> times).There are also few more workarounds, but it would be great if Spark 
> will support preventing predicate pushdown by user.
>  
> For example: df.withAnalysisBarrier().where(...) ?
>  
> Note, that this should not be a global configuration option. If I read 2 
> DataFrames, df1 and df2, I would like to specify that df1 should not have 
> some predicates pushed down, but some may be, but df2 should have all 
> predicates pushed down, even if target query joins df1 and df2. As far as I 
> understand Spark optimizer, if we use functions like `withAnalysisBarrier` 
> and put AnalysisBarrier explicitly in logical plan, then predicates won't be 
> pushed down on this particular DataFrames and PP will be still possible on 
> the second one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to