[ https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Reynold Xin resolved SPARK-9814. -------------------------------- Resolution: Fixed Assignee: Hyukjin Kwon Fix Version/s: 1.5.0 > EqualNotNull not passing to data sources > ---------------------------------------- > > Key: SPARK-9814 > URL: https://issues.apache.org/jira/browse/SPARK-9814 > Project: Spark > Issue Type: Improvement > Components: SQL > Reporter: Hyukjin Kwon > Assignee: Hyukjin Kwon > Priority: Minor > Fix For: 1.5.0 > > > When data sources (such as Parquet) tries to filter data when reading from > HDFS (not in memory), Physical planing phase passes the filter objects in > {{org.apache.spark.sql.sources}}, which are appropriately built and picked up > by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}. > On the other hand, it does not pass {{EqualNullSafe}} filter in > {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible > to pass for other datasources such as Parquet and JSON. In more detail, it > does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in > {{PrunedFilteredScan}} and {{PrunedScan}}, > {code} > def buildScan(requiredColumns: Array[String], filters: Array[Filter]): > RDD[Row] > {code} > even though the binary capability issue is > solved.(https://issues.apache.org/jira/browse/SPARK-8747). > I understand that {{CatalystScan}} can take the all raw expressions accessing > to the query planner. However, it is experimental and also it needs different > interfaces (as well as unstable for the reasons such as binary capability). > In general, the problem below can happen. > 1. > {code:sql} > SELECT * FROM table WHERE field = 1; > {code} > > 2. > {code:sql} > SELECT * FROM table WHERE field <=> 1; > {code} > The second query can be hugely slow although the functionally is almost > identical because of the possible large network traffic (etc.) by not > filtered data from the source RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org