[ https://issues.apache.org/jira/browse/SPARK-25306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-25306: ------------------------------------ Assignee: Apache Spark > Use cache to speed up `createFilter` in ORC > ------------------------------------------- > > Key: SPARK-25306 > URL: https://issues.apache.org/jira/browse/SPARK-25306 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: SQL > Reporter: Dongjoon Hyun > Assignee: Apache Spark > Priority: Critical > > In ORC data source, `createFilter` function has exponential time complexity > due to lack of memoization like the following. This issue aims to improve it. > *REPRODUCE* > {code} > // Create and read 1 row table with 1000 columns > sql("set spark.sql.orc.filterPushdown=true") > val selectExpr = (1 to 1000).map(i => s"id c$i") > spark.range(1).selectExpr(selectExpr: > _*).write.mode("overwrite").orc("/tmp/orc") > print(s"With 0 filters, ") > spark.time(spark.read.orc("/tmp/orc").count) > // Increase the number of filters > (20 to 30).foreach { width => > val whereExpr = (1 to width).map(i => s"c$i is not null").mkString(" and ") > print(s"With $width filters, ") > spark.time(spark.read.orc("/tmp/orc").where(whereExpr).count) > } > {code} > *RESULT* > {code} > With 0 filters, Time taken: 653 ms > > With 20 filters, Time taken: 962 ms > With 21 filters, Time taken: 1282 ms > With 22 filters, Time taken: 1982 ms > With 23 filters, Time taken: 3855 ms > With 24 filters, Time taken: 6719 ms > With 25 filters, Time taken: 12669 ms > With 26 filters, Time taken: 25032 ms > With 27 filters, Time taken: 49585 ms > With 28 filters, Time taken: 98980 ms // over 1 min 38 seconds > With 29 filters, Time taken: 198368 ms // over 3 mins > With 30 filters, Time taken: 393744 ms // over 6 mins > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org