Re: Re: [DISCUSS] FLIP-240: Introduce "ANALYZE TABLE" Syntax

Jing Ge Wed, 15 Jun 2022 12:53:47 -0700

Hi Godfrey,

Thanks for driving this! There are some areas where I couldn't find enough
information in the FLIP, just wondering if I could get more
explanation from you w.r.t. the following questions:


1. When will the converted SELECT statement of the ANALYZE TABLE be
submitted? right after the CREATE TABLE?

2. Will it be submitted periodically to keep the statistical data
up-to-date, since the data might be mutable?

3. " If no partition is specified, the statistics will be gathered for all
partitions"  - I think this is fine for multi-level partitions, e.g. PARTITION
(ds='2022-06-01') means two partitions: PARTITION (ds='2022-06-01', hr=1)
and PARTITION (ds='2022-06-01', hr=2), because it will save a lot of code
and therefore help developer work more efficiently. If we use this rule for
top level partitions, It might not be strong enough to avoid human
error, e.g. developer might trigger huge selection on the table with many
partitions, when he forgot to write the partition in the ANALYZE TABLE
script. In this case, I would suggest using FOR ALL PARTITIONS explicitly
just like FOR ALL COLUMNS.

Best regards,
Jing


On Wed, Jun 15, 2022 at 10:16 AM godfrey he <godfre...@gmail.com> wrote:

> Hi Jark,
>
> Thanks for the inputs.
>
> >Do we need to provide DESC EXTENDED <table_name> statement like Spark[1]
> to
> >show statistic for table/partition/columns?
> We do have supported `DESC EXTENDED` syntax, but currently only table
> schema
> will be display, I think we just need a JIRA to support it.
>
> > is it possible to ignore execution mode and force using batch mode for
> the statement?
> As I replied above, The semantics of `ANALYZE TABLE` does not
> distinguish batch and streaming,
> It works for both batch and streaming, but the result of unbounded
> sources is meaningless.
> Currently, I throw exception for streaming mode,
> and we can support streaming mode with bounded source in the future.
>
> Best,
> Godfrey
>
> Jark Wu <imj...@gmail.com> 于2022年6月14日周二 17:56写道：
> >
> > Hi Godfrey, thanks for starting this discussion, this is a great feature
> > for batch users.
> >
> > The FLIP looks good to me in general.
> >
> > I only have 2 comments:
> >
> > 1) How do users know whether the given table or partition contains
> required
> > statistics?
> > Do we need to provide DESC EXTENDED <table_name> statement like Spark[1]
> to
> > show statistic for table/partition/columns?
> >
> > 2) If ANALYZE TABLE can only run in batch mode, is it possible to ignore
> > execution mode
> > and force using batch mode for the statement? From my perspective,
> ANALYZE
> > TABLE
> > is an auxiliary statement similar to SHOW TABLES but heavier, which
> doesn't
> > care about
> > environment execution mode.
> >
> > Best,
> > Jark
> >
> > [1]:
> >
> https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-aux-analyze-table.html
> >
> > On Tue, 14 Jun 2022 at 13:52, Jing Ge <j...@ververica.com> wrote:
> >
> > > Hi 华宗
> > >
> > > 退订请发送任意消息至dev-unsubscr...@flink.apache.org
> > > In order to unsubscribe, please send an email to
> > > dev-unsubscr...@flink.apache.org
> > >
> > > Thanks
> > >
> > > Best regards,
> > > Jing
> > >
> > >
> > > On Tue, Jun 14, 2022 at 2:05 AM 华宗 <zhanghuaz...@126.com> wrote:
> > >
> > > > 退订
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > At 2022-06-13 22:44:24, "cao zou" <zoucao...@gmail.com> wrote:
> > > > >Hi godfrey, thanks for your detail explanation.
> > > > >After explaining and glancing over the FLIP-231, I think it is
> > > > >really need, +1 for this and looking forward to it.
> > > > >
> > > > >best
> > > > >zoucao
> > > > >
> > > > >godfrey he <godfre...@gmail.com> 于2022年6月13日周一 14:43写道：
> > > > >
> > > > >> Hi Ingo,
> > > > >>
> > > > >> The semantics does not distinguish batch and streaming,
> > > > >> It works for both batch and streaming, but the result of
> > > > >> unbounded sources is meaningless.
> > > > >> Currently, I throw exception for streaming mode,
> > > > >> and we can support streaming mode with bounded source
> > > > >> in the future.
> > > > >>
> > > > >> Best,
> > > > >> Godfrey
> > > > >>
> > > > >> Ingo Bürk <airbla...@apache.org> 于2022年6月13日周一 14:17写道：
> > > > >> >
> > > > >> > Hi Godfrey,
> > > > >> >
> > > > >> > thank you for the explanation. A SELECT is definitely more
> generic
> > > and
> > > > >> > will work for all connectors automatically. As such I think
> it's a
> > > > good
> > > > >> > baseline solution regardless.
> > > > >> >
> > > > >> > We can also think about allowing connector-specific
> optimizations in
> > > > the
> > > > >> > future, but I do like your idea of letting the optimizer rules
> > > > perform a
> > > > >> > lot of the work here already by leveraging existing
> optimizations.
> > > > >> > Similarly things like non-null counts of non-nullable columns
> would
> > > > (or
> > > > >> > at least could) be handled by the optimizer rules already.
> > > > >> >
> > > > >> > So as far as that point goes, +1 to the generic approach.
> > > > >> >
> > > > >> > One more point, though: In general we should avoid supporting
> > > features
> > > > >> > only in specific modes as it breaks the unification promise.
> Given
> > > > that
> > > > >> > ANALYZE is a manual and completely optional operation I'm OK
> with
> > > > doing
> > > > >> > that here in principle. However, I wonder what will happen in
> the
> > > > >> > streaming / unbounded case. Do you plan to throw an error? Or
> do we
> > > > >> > complete the command as successful but without doing anything?
> > > > >> >
> > > > >> >
> > > > >> > Best
> > > > >> > Ingo
> > > > >> >
> > > > >> > On 13.06.22 05:50, godfrey he wrote:
> > > > >> > > Hi Ingo,
> > > > >> > >
> > > > >> > > Thanks for the inputs.
> > > > >> > >
> > > > >> > > I think converting `ANALYZE TABLE` to `SELECT` statement is
> > > > >> > > more generic approach. Because query plan optimization is more
> > > > generic,
> > > > >> > >   we can provide more optimization rules to optimize not only
> > > > `SELECT`
> > > > >> statement
> > > > >> > > converted from `ANALYZE TABLE` but also the `SELECT` statement
> > > > written
> > > > >> by users.
> > > > >> > >
> > > > >> > >> JDBC connector can get a row count estimate without
> performing a
> > > > >> > >> SELECT COUNT(1)
> > > > >> > > To optimize such cases, we can implement a rule to push
> aggregate
> > > > into
> > > > >> > > table source.
> > > > >> > > Currently, there is a similar rule: SupportsAggregatePushDown,
> > > which
> > > > >> > > supports only pushing
> > > > >> > > local aggregate into source now.
> > > > >> > >
> > > > >> > >
> > > > >> > > Best,
> > > > >> > > Godfrey
> > > > >> > >
> > > > >> > > Ingo Bürk <airbla...@apache.org> 于2022年6月10日周五 17:15写道：
> > > > >> > >>
> > > > >> > >> Hi Godfrey,
> > > > >> > >>
> > > > >> > >> compared to the solution proposed in the FLIP (using a SELECT
> > > > >> > >> statement), I wonder if you have considered adding APIs to
> > > > catalogs /
> > > > >> > >> connectors to perform this task as an alternative?
> > > > >> > >> I could imagine that for many connectors, statistics could be
> > > > >> > >> implemented in a less expensive way by leveraging the
> underlying
> > > > >> system
> > > > >> > >> (e.g. a JDBC connector can get a row count estimate without
> > > > >> performing a
> > > > >> > >> SELECT COUNT(1)).
> > > > >> > >>
> > > > >> > >>
> > > > >> > >> Best
> > > > >> > >> Ingo
> > > > >> > >>
> > > > >> > >>
> > > > >> > >> On 10.06.22 09:53, godfrey he wrote:
> > > > >> > >>> Hi all,
> > > > >> > >>>
> > > > >> > >>> I would like to open a discussion on FLIP-240:  Introduce
> > > "ANALYZE
> > > > >> > >>> TABLE" Syntax.
> > > > >> > >>>
> > > > >> > >>> As FLIP-231 mentioned, statistics are one of the most
> important
> > > > >> inputs
> > > > >> > >>> to the optimizer. Accurate and complete statistics allows
> the
> > > > >> > >>> optimizer to be more powerful. "ANALYZE TABLE" syntax is a
> very
> > > > >> common
> > > > >> > >>> but effective approach to gather statistics, which is
> already
> > > > >> > >>> introduced by many compute engines and databases.
> > > > >> > >>>
> > > > >> > >>> The main purpose of  discussion is to introduce "ANALYZE
> TABLE"
> > > > >> syntax
> > > > >> > >>> for Flink sql.
> > > > >> > >>>
> > > > >> > >>> You can find more details in FLIP-240 document[1]. Looking
> > > > forward to
> > > > >> > >>> your feedback.
> > > > >> > >>>
> > > > >> > >>> [1]
> > > > >>
> > > >
> > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=217386481
> > > > >> > >>> [2] POC: https://github.com/godfreyhe/flink/tree/FLIP-240
> > > > >> > >>>
> > > > >> > >>>
> > > > >> > >>> Best,
> > > > >> > >>> Godfrey
> > > > >>
> > > >
> > >
>

Re: Re: [DISCUSS] FLIP-240: Introduce "ANALYZE TABLE" Syntax

Reply via email to