Hi Jingsong,

Thanks for the feedback.


>One concern I have is that we read the footer for each file, and this may
>be a bit costly in some cases. Is it possible for us to have some
> hierarchical way
yes, if there are thousands of orc/parquet files, it may take a long time.
So we can introduce a config option to let the user choose the
granularity of the statistics.
But the SIZE will not be introduced, because the planner does not use
the file size statistics now.
We can introduce once file size statistics is introduce in the future.
I think we should also introduce a config option to enable/disable
SupportStatisticReport,
because it's a heavy operation for some connectors in some cases.

> is the filter pushdown already happening at
> this time?
That's a good point. Currently, the filter push down is after partition pruning
to prevent the filter push down rule from consuming the partition predicates.
The statistics will be set to unknown if filter is pushed down now.
To combine them all, we can create an optimization program after filter push
down program to collect the statistics. This could avoid collecting
statistics multiple times.


Best,
Godfrey

Jingsong Li <jingsongl...@gmail.com> 于2022年5月13日周五 22:44写道:
>
> Thank Godfrey for driving.
>
> Looks very good~ This will undoubtedly greatly enhance the various batch
> mode connectors.
>
> I left some comments:
>
> ## FileBasedStatisticsReportableDecodingFormat
>
> One concern I have is that we read the footer for each file, and this may
> be a bit costly in some cases. Is it possible for us to have some
> hierarchical way, e.g.
> - No statistics are collected for files by default.
> - SIZE: Generate statistics based on file Size, get the size of the file
> only with access to the master of the FileSystem.
> - DETAILED: Get the complete statistics by format, possibly by accessing
> the footer of the file.
>
> ## When use the statistics reported by connector
>
> > When partitions are pruned by PushPartitionIntoTableSourceScanRule, the
> statistics should also be updated.
>
> I understand that we definitely need to use reporter after the partition
> prune, but another question: is the filter pushdown already happening at
> this time?
> Can we make sure that in the following three cases, both the filter
> pushdown and the partition prune happen before the stats reporting.
> - only partition prune happens
> - only filter pushdown happens
> - both filter pushdown and partition prune happen
>
> Best,
> Jingsong
>
> On Fri, May 13, 2022 at 6:57 PM godfrey he <godfre...@gmail.com> wrote:
>
> > Hi all,
> >
> > I would like to open a discussion on FLIP-231:  Introduce
> > SupportStatisticReport
> > to support reporting statistics from source connectors.
> >
> > Statistics are one of the most important inputs to the optimizer.
> > Accurate and complete statistics allows the optimizer to be more powerful.
> > Currently, the statistics of Flink SQL come from Catalog only,
> > while many Connectors have the ability to provide statistics, e.g.
> > FileSystem.
> > In production, we find many tables in Catalog do not have any statistics.
> > As a result, the optimizer can't generate better execution plans,
> > especially for Batch jobs.
> >
> > There are two approaches to enhance statistics for the planner,
> > one is to introduce the "ANALYZE TABLE" syntax which will write
> > the analyzed result to the catalog, another is to introduce a new
> > connector interface
> > which allows the connector itself to report statistics directly to the
> > planner.
> > The second one is a supplement to the catalog statistics.
> >
> > Here, we will discuss the second approach. Compared to the first one,
> > the second one is to get statistics in real time, no need to run an
> > analysis job for each table. This could help improve the user
> > experience.
> > (We will also introduce the "ANALYZE TABLE" syntax in other FLIP.)
> >
> > You can find more details in FLIP-231 document[1]. Looking forward to
> > your feedback.
> >
> > [1]
> > https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=211883860&draftShareId=eda17eaa-43f9-4dc1-9a7d-3a9b5a4bae00&;
> > [2] POC: https://github.com/godfreyhe/flink/tree/FLIP-231
> >
> >
> > Best,
> > Godfrey
> >

Reply via email to