[ https://issues.apache.org/jira/browse/SPARK-10978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998101#comment-14998101 ]
Hyukjin Kwon edited comment on SPARK-10978 at 11/10/15 7:34 AM: ---------------------------------------------------------------- I think we should add another interface such as {{def partiallyHandledFilters}}. For example in the case of ORC file, it does not filter record by record but rough results. In this case, Spark-side filter should be applied. I manually added some codes at {{def unhandledFilters}} for Parquet and ORC datasources and I could reproduce the wrong results for ORC files. I have been working on this since I though ORC filters are not pushed down but it looks any filters are not pushed down for all the datasources and I guess [~lian cheng] is working on this. Could I try to add this if it is an issue and if you are not doing this? I unintentionally opened this issue here as I though this is an issue. https://issues.apache.org/jira/browse/SPARK-11621 was (Author: hyukjin.kwon): I think we should add another interface such as {{def partiallyHandledFilters}}. For example in the case of ORC file, it does not filter record by record but rough results. In this case, Spark-side filter should be applied. I manually added some codes at {{def unhandledFilters}} for Parquet and ORC datasources and I could reproduce the wrong results for ORC files. I have been working on this since I though ORC filters are not pushed down but it looks any filters are not pushed down for all the datasources and I guess [~lian cheng] is working on this. Could I try to add this if it is an issue and if you are not doing this? I unintentionally opened this issue here as I though this is an issue. https://issues.apache.org/jira/browse/SPARK-10978 > Allow PrunedFilterScan to eliminate predicates from further evaluation > ---------------------------------------------------------------------- > > Key: SPARK-10978 > URL: https://issues.apache.org/jira/browse/SPARK-10978 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 1.3.0, 1.4.0, 1.5.0 > Reporter: Russell Alexander Spitzer > Assignee: Cheng Lian > Priority: Critical > Fix For: 1.6.0 > > > Currently PrunedFilterScan allows implementors to push down predicates to an > underlying datasource. This is done solely as an optimization as the > predicate will be reapplied on the Spark side as well. This allows for > bloom-filter like operations but ends up doing a redundant scan for those > sources which can do accurate pushdowns. > In addition it makes it difficult for underlying sources to accept queries > which reference non-existent to provide ancillary function. In our case we > allow a solr query to be passed in via a non-existent solr_query column. > Since this column is not returned when Spark does a filter on "solr_query" > nothing passes. > Suggestion on the ML from [~marmbrus] > {quote} > We have to try and maintain binary compatibility here, so probably the > easiest thing to do here would be to add a method to the class. Perhaps > something like: > def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters > By default, this could return all filters so behavior would remain the same, > but specific implementations could override it. There is still a chance that > this would conflict with existing methods, but hopefully that would not be a > problem in practice. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org