Hi, while working with JDBC datasource I saw that many "or" clauses with non-equality operators causes huge performance degradation of SQL query to database (DB2). For example:
val df = spark.read.format("jdbc").(other options to parallelize load).load() df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x > 100)").show() // in real application whose predicates were pushed many many lines below, many ANDs and ORs If I use cache() before where, there is no predicate pushdown of this "where" clause. However, in production system caching many sources is a waste of memory (especially is pipeline is long and I must do cache many times). I asked on StackOverflow for better ideas: https://stackoverflow.com/questions/50336355/how-to-prevent-predicate-pushdown However, there are only workarounds. I can use those workarounds, but maybe it would be better to add such functionality directly in the API? For example: df.withAnalysisBarrier().where(...) ? Please let me know if I should create a JIRA or it's not a good idea for some reasons. Pozdrawiam / Best regards, Tomek Gawęda