[ https://issues.apache.org/jira/browse/SPARK-24288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16476900#comment-16476900 ]
Maryann Xue commented on SPARK-24288: ------------------------------------- After taking a closer look at the problem, I realized that we should probably define the scope of the problem first. The issue [~TomaszGaweda] encountered specifically is predicates being pushed down to the data source; while a more general topic is the isolation of plan optimization/transformation between operators, somewhat like the effect "cache()" would achieve. 1) So for the first specific purpose, i.e., avoid predicate push-down to the data source, we could have an option when initiating the {{DataSource}}, e.g., by calling {{spark.read.option()}} or {{spark.jdbc()}} with options, etc. This might not cover all cases of DataSource creation, but would be a workable solution for users. 2) For the more general purpose (which means we can stop push-down at any level), we could add a special logical operator (let's call it OptimizerBarrier maybe) that serves as a barrier node when performing optimizations. Then we could simply ditch this node at the time of physical plan transformation. I personally prefer the second solution, which serves a more general purpose and may require less code changes (did a quick experiment and it worked). Please let me know your thoughts, [~smilegator], [~TomaszGaweda]. > Enable preventing predicate pushdown > ------------------------------------ > > Key: SPARK-24288 > URL: https://issues.apache.org/jira/browse/SPARK-24288 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 2.3.0 > Reporter: Tomasz Gawęda > Priority: Major > > Issue discussed on Mailing List: > [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html] > While working with JDBC datasource I saw that many "or" clauses with > non-equality operators causes huge performance degradation of SQL query > to database (DB2). For example: > val df = spark.read.format("jdbc").(other options to parallelize > load).load() > df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x > > 100)").show() // in real application whose predicates were pushed > many many lines below, many ANDs and ORs > If I use cache() before where, there is no predicate pushdown of this > "where" clause. However, in production system caching many sources is a > waste of memory (especially is pipeline is long and I must do cache many > times).There are also few more workarounds, but it would be great if Spark > will support preventing predicate pushdown by user. > > For example: df.withAnalysisBarrier().where(...) ? > > Note, that this should not be a global configuration option. If I read 2 > DataFrames, df1 and df2, I would like to specify that df1 should not have > some predicates pushed down, but some may be, but df2 should have all > predicates pushed down, even if target query joins df1 and df2. As far as I > understand Spark optimizer, if we use functions like `withAnalysisBarrier` > and put AnalysisBarrier explicitly in logical plan, then predicates won't be > pushed down on this particular DataFrames and PP will be still possible on > the second one. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org