[ 
https://issues.apache.org/jira/browse/FLINK-5859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880231#comment-15880231
 ] 

Fabian Hueske commented on FLINK-5859:
--------------------------------------

Hi [~ykt836],

My main motivation to treat partition pruning as filter push-down is to keep 
the complexity of the optimizer as small as possible.

You are right, the effort to determine whether a filter can be applied or not 
depends on the format of the source. However, I don't think that this 
necessarily means that partition pruning must be handled as a special case. In 
the end it depends on the TableSource how it determines which predicates apply 
and which don't. A partitionable table source would not need to scan all 
metadata. 

I see your point about the effort and complexity to implement a partitionable 
TableSource. 
What do you think of the following approach?
We implement a {{PartitionableTableSource}} as an abstract class that 
implements the {{FilterableTableSource}} interface. 
{{PartitionableTableSource}} would have abstract methods to list the 
partitioned fields (and maybe some more). Based on that information 
{{PartitionableTableSource}} implements 
{{FilterableTableSource.setPredicate()}} and 
{{FilterableTableSource.getPredicate()}}, i.e., the 
{{PartitionableTableSource}} automatically extracts the right filter 
expressions and returns everything it cannot deal with based on the provided 
partitioned fields.
TableSources which just support filter push-down by partition pruning implement 
{{PartitionableTableSource}} and only have to specify the partition columns and 
not have to deal with {{setPredicate()}}.

This solution would keep all partition pruning related logic out of the 
optimizer and table schemas. 

What you think?

> support partition pruning on Table API & SQL
> --------------------------------------------
>
>                 Key: FLINK-5859
>                 URL: https://issues.apache.org/jira/browse/FLINK-5859
>             Project: Flink
>          Issue Type: New Feature
>          Components: Table API & SQL
>            Reporter: godfrey he
>            Assignee: godfrey he
>
> Many data sources are partitionable storage, e.g. HDFS, Druid. And many 
> queries just need to read a small subset of the total data. We can use 
> partition information to prune or skip over files irrelevant to the user’s 
> queries. Both query optimization time and execution time can be reduced 
> obviously, especially for a large partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to