[
https://issues.apache.org/jira/browse/FLINK-5859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15879991#comment-15879991
]
Kurt Young commented on FLINK-5859:
-----------------------------------
Hi [~fhueske],
You raised a very good question, which is essentially "what's the difference
between the filter pushdown and partition pruning".
You are right about partition pruning is actually a coarse-grained filter
push-down, but more importantly, we can view it as a more "static" or more
"predictable" filter. Here is an example to describe more explicitly. Suppose
Flink supports parquet table source, and since parquet files contains some
RowGroup level statistics such as max/min value, we can use these information
to reduce the data we need to read. But before we do anything, we need to make
sure whether the source files contain such information or not. So we need to
read all the metas from these files to do some check work. If we are facing
thousands of the files, it will be really costly.
However, the partition is something more static and predictable. Like if all
your source files are organized by some time based directory like
YYYY/mm/dd/1.file and we actually have some partition fields to describe the
time information. It will be more efficient and easy to do the partition level
filter first.
But this doesn't mean we should have another trait like
{{PartitionableTableSource}}, either extending the under reviewing
{{FilterableTableSource}} or provide another explicitly
{{PartitionableTableSource}} is fine with me. But we should at least make
"partition pruning" seeable from the users who may write their own
{{TableSource}}, instead of let all the magics happen under one method right
now, which will be
{code}def setPredicate(predicate: Array[Expression]): Array[Expression]{code}
in the current under reviewing version of {{FilterableTableSource}}.
Let me know if you have some thoughts about this.
> support partition pruning on Table API & SQL
> --------------------------------------------
>
> Key: FLINK-5859
> URL: https://issues.apache.org/jira/browse/FLINK-5859
> Project: Flink
> Issue Type: New Feature
> Components: Table API & SQL
> Reporter: godfrey he
> Assignee: godfrey he
>
> Many data sources are partitionable storage, e.g. HDFS, Druid. And many
> queries just need to read a small subset of the total data. We can use
> partition information to prune or skip over files irrelevant to the user’s
> queries. Both query optimization time and execution time can be reduced
> obviously, especially for a large partitioned table.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)