Re: Re: [DISCUSS] FLIP-240: Introduce "ANALYZE TABLE" Syntax

godfrey he Thu, 23 Jun 2022 07:31:26 -0700

Hi, everyone.

Thanks for all the inputs.
If there is no feedback any more, I will start the vote tomorrow.


Best,
Godfrey

Jing Ge <j...@ververica.com> 于2022年6月22日周三 15:50写道：
>
> sounds good to me. Thanks!
>
> Best regards,
> Jing
>
> On Fri, Jun 17, 2022 at 5:37 AM godfrey he <godfre...@gmail.com> wrote:
>
> > Hi, Jing.
> >
> > Thanks for the feedback.
> >
> > >When will the converted SELECT statement of the ANALYZE TABLE be
> > > submitted? right after the CREATE TABLE?
> > The SELECT  job will be submitted only when `ANALYZE TABLE` is executed,
> > and there is nothing to do with CREATE TABLE. Because the `ANALYZE TABLE`
> > is triggered manually as needed.
> >
> > >Will it be submitted periodically to keep the statistical data
> > >up-to-date, since the data might be mutable?
> > the `ANALYZE TABLE` is triggered manually as needed.
> > I will update the doc.
> >
> > >It might not be strong enough to avoid human error
> > > I would suggest using FOR ALL PARTITIONS explicitly
> > > just like FOR ALL COLUMNS.
> > Agree, specifying `PARTITION` explicitly is more friendly
> > and safe. I prefer to use `PARTITION(ds, hr)` without
> > specific partition value, hive has the similar syntax.
> > WDYT ?
> >
> > Best,
> > Godfrey
> >
> > Jing Ge <j...@ververica.com> 于2022年6月16日周四 03:53写道：
> > >
> > > Hi Godfrey,
> > >
> > > Thanks for driving this! There are some areas where I couldn't find
> > enough
> > > information in the FLIP, just wondering if I could get more
> > > explanation from you w.r.t. the following questions:
> > >
> > > 1. When will the converted SELECT statement of the ANALYZE TABLE be
> > > submitted? right after the CREATE TABLE?
> > >
> > > 2. Will it be submitted periodically to keep the statistical data
> > > up-to-date, since the data might be mutable?
> > >
> > > 3. " If no partition is specified, the statistics will be gathered for
> > all
> > > partitions"  - I think this is fine for multi-level partitions, e.g.
> > PARTITION
> > > (ds='2022-06-01') means two partitions: PARTITION (ds='2022-06-01', hr=1)
> > > and PARTITION (ds='2022-06-01', hr=2), because it will save a lot of code
> > > and therefore help developer work more efficiently. If we use this rule
> > for
> > > top level partitions, It might not be strong enough to avoid human
> > > error, e.g. developer might trigger huge selection on the table with many
> > > partitions, when he forgot to write the partition in the ANALYZE TABLE
> > > script. In this case, I would suggest using FOR ALL PARTITIONS explicitly
> > > just like FOR ALL COLUMNS.
> > >
> > > Best regards,
> > > Jing
> > >
> > >
> > > On Wed, Jun 15, 2022 at 10:16 AM godfrey he <godfre...@gmail.com> wrote:
> > >
> > > > Hi Jark,
> > > >
> > > > Thanks for the inputs.
> > > >
> > > > >Do we need to provide DESC EXTENDED <table_name> statement like
> > Spark[1]
> > > > to
> > > > >show statistic for table/partition/columns?
> > > > We do have supported `DESC EXTENDED` syntax, but currently only table
> > > > schema
> > > > will be display, I think we just need a JIRA to support it.
> > > >
> > > > > is it possible to ignore execution mode and force using batch mode
> > for
> > > > the statement?
> > > > As I replied above, The semantics of `ANALYZE TABLE` does not
> > > > distinguish batch and streaming,
> > > > It works for both batch and streaming, but the result of unbounded
> > > > sources is meaningless.
> > > > Currently, I throw exception for streaming mode,
> > > > and we can support streaming mode with bounded source in the future.
> > > >
> > > > Best,
> > > > Godfrey
> > > >
> > > > Jark Wu <imj...@gmail.com> 于2022年6月14日周二 17:56写道：
> > > > >
> > > > > Hi Godfrey, thanks for starting this discussion, this is a great
> > feature
> > > > > for batch users.
> > > > >
> > > > > The FLIP looks good to me in general.
> > > > >
> > > > > I only have 2 comments:
> > > > >
> > > > > 1) How do users know whether the given table or partition contains
> > > > required
> > > > > statistics?
> > > > > Do we need to provide DESC EXTENDED <table_name> statement like
> > Spark[1]
> > > > to
> > > > > show statistic for table/partition/columns?
> > > > >
> > > > > 2) If ANALYZE TABLE can only run in batch mode, is it possible to
> > ignore
> > > > > execution mode
> > > > > and force using batch mode for the statement? From my perspective,
> > > > ANALYZE
> > > > > TABLE
> > > > > is an auxiliary statement similar to SHOW TABLES but heavier, which
> > > > doesn't
> > > > > care about
> > > > > environment execution mode.
> > > > >
> > > > > Best,
> > > > > Jark
> > > > >
> > > > > [1]:
> > > > >
> > > >
> > https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-aux-analyze-table.html
> > > > >
> > > > > On Tue, 14 Jun 2022 at 13:52, Jing Ge <j...@ververica.com> wrote:
> > > > >
> > > > > > Hi 华宗
> > > > > >
> > > > > > 退订请发送任意消息至dev-unsubscr...@flink.apache.org
> > > > > > In order to unsubscribe, please send an email to
> > > > > > dev-unsubscr...@flink.apache.org
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > Best regards,
> > > > > > Jing
> > > > > >
> > > > > >
> > > > > > On Tue, Jun 14, 2022 at 2:05 AM 华宗 <zhanghuaz...@126.com> wrote:
> > > > > >
> > > > > > > 退订
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > At 2022-06-13 22:44:24, "cao zou" <zoucao...@gmail.com> wrote:
> > > > > > > >Hi godfrey, thanks for your detail explanation.
> > > > > > > >After explaining and glancing over the FLIP-231, I think it is
> > > > > > > >really need, +1 for this and looking forward to it.
> > > > > > > >
> > > > > > > >best
> > > > > > > >zoucao
> > > > > > > >
> > > > > > > >godfrey he <godfre...@gmail.com> 于2022年6月13日周一 14:43写道：
> > > > > > > >
> > > > > > > >> Hi Ingo,
> > > > > > > >>
> > > > > > > >> The semantics does not distinguish batch and streaming,
> > > > > > > >> It works for both batch and streaming, but the result of
> > > > > > > >> unbounded sources is meaningless.
> > > > > > > >> Currently, I throw exception for streaming mode,
> > > > > > > >> and we can support streaming mode with bounded source
> > > > > > > >> in the future.
> > > > > > > >>
> > > > > > > >> Best,
> > > > > > > >> Godfrey
> > > > > > > >>
> > > > > > > >> Ingo Bürk <airbla...@apache.org> 于2022年6月13日周一 14:17写道：
> > > > > > > >> >
> > > > > > > >> > Hi Godfrey,
> > > > > > > >> >
> > > > > > > >> > thank you for the explanation. A SELECT is definitely more
> > > > generic
> > > > > > and
> > > > > > > >> > will work for all connectors automatically. As such I think
> > > > it's a
> > > > > > > good
> > > > > > > >> > baseline solution regardless.
> > > > > > > >> >
> > > > > > > >> > We can also think about allowing connector-specific
> > > > optimizations in
> > > > > > > the
> > > > > > > >> > future, but I do like your idea of letting the optimizer
> > rules
> > > > > > > perform a
> > > > > > > >> > lot of the work here already by leveraging existing
> > > > optimizations.
> > > > > > > >> > Similarly things like non-null counts of non-nullable
> > columns
> > > > would
> > > > > > > (or
> > > > > > > >> > at least could) be handled by the optimizer rules already.
> > > > > > > >> >
> > > > > > > >> > So as far as that point goes, +1 to the generic approach.
> > > > > > > >> >
> > > > > > > >> > One more point, though: In general we should avoid
> > supporting
> > > > > > features
> > > > > > > >> > only in specific modes as it breaks the unification promise.
> > > > Given
> > > > > > > that
> > > > > > > >> > ANALYZE is a manual and completely optional operation I'm OK
> > > > with
> > > > > > > doing
> > > > > > > >> > that here in principle. However, I wonder what will happen
> > in
> > > > the
> > > > > > > >> > streaming / unbounded case. Do you plan to throw an error?
> > Or
> > > > do we
> > > > > > > >> > complete the command as successful but without doing
> > anything?
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > Best
> > > > > > > >> > Ingo
> > > > > > > >> >
> > > > > > > >> > On 13.06.22 05:50, godfrey he wrote:
> > > > > > > >> > > Hi Ingo,
> > > > > > > >> > >
> > > > > > > >> > > Thanks for the inputs.
> > > > > > > >> > >
> > > > > > > >> > > I think converting `ANALYZE TABLE` to `SELECT` statement
> > is
> > > > > > > >> > > more generic approach. Because query plan optimization is
> > more
> > > > > > > generic,
> > > > > > > >> > >   we can provide more optimization rules to optimize not
> > only
> > > > > > > `SELECT`
> > > > > > > >> statement
> > > > > > > >> > > converted from `ANALYZE TABLE` but also the `SELECT`
> > statement
> > > > > > > written
> > > > > > > >> by users.
> > > > > > > >> > >
> > > > > > > >> > >> JDBC connector can get a row count estimate without
> > > > performing a
> > > > > > > >> > >> SELECT COUNT(1)
> > > > > > > >> > > To optimize such cases, we can implement a rule to push
> > > > aggregate
> > > > > > > into
> > > > > > > >> > > table source.
> > > > > > > >> > > Currently, there is a similar rule:
> > SupportsAggregatePushDown,
> > > > > > which
> > > > > > > >> > > supports only pushing
> > > > > > > >> > > local aggregate into source now.
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > Best,
> > > > > > > >> > > Godfrey
> > > > > > > >> > >
> > > > > > > >> > > Ingo Bürk <airbla...@apache.org> 于2022年6月10日周五 17:15写道：
> > > > > > > >> > >>
> > > > > > > >> > >> Hi Godfrey,
> > > > > > > >> > >>
> > > > > > > >> > >> compared to the solution proposed in the FLIP (using a
> > SELECT
> > > > > > > >> > >> statement), I wonder if you have considered adding APIs
> > to
> > > > > > > catalogs /
> > > > > > > >> > >> connectors to perform this task as an alternative?
> > > > > > > >> > >> I could imagine that for many connectors, statistics
> > could be
> > > > > > > >> > >> implemented in a less expensive way by leveraging the
> > > > underlying
> > > > > > > >> system
> > > > > > > >> > >> (e.g. a JDBC connector can get a row count estimate
> > without
> > > > > > > >> performing a
> > > > > > > >> > >> SELECT COUNT(1)).
> > > > > > > >> > >>
> > > > > > > >> > >>
> > > > > > > >> > >> Best
> > > > > > > >> > >> Ingo
> > > > > > > >> > >>
> > > > > > > >> > >>
> > > > > > > >> > >> On 10.06.22 09:53, godfrey he wrote:
> > > > > > > >> > >>> Hi all,
> > > > > > > >> > >>>
> > > > > > > >> > >>> I would like to open a discussion on FLIP-240:
> > Introduce
> > > > > > "ANALYZE
> > > > > > > >> > >>> TABLE" Syntax.
> > > > > > > >> > >>>
> > > > > > > >> > >>> As FLIP-231 mentioned, statistics are one of the most
> > > > important
> > > > > > > >> inputs
> > > > > > > >> > >>> to the optimizer. Accurate and complete statistics
> > allows
> > > > the
> > > > > > > >> > >>> optimizer to be more powerful. "ANALYZE TABLE" syntax
> > is a
> > > > very
> > > > > > > >> common
> > > > > > > >> > >>> but effective approach to gather statistics, which is
> > > > already
> > > > > > > >> > >>> introduced by many compute engines and databases.
> > > > > > > >> > >>>
> > > > > > > >> > >>> The main purpose of  discussion is to introduce "ANALYZE
> > > > TABLE"
> > > > > > > >> syntax
> > > > > > > >> > >>> for Flink sql.
> > > > > > > >> > >>>
> > > > > > > >> > >>> You can find more details in FLIP-240 document[1].
> > Looking
> > > > > > > forward to
> > > > > > > >> > >>> your feedback.
> > > > > > > >> > >>>
> > > > > > > >> > >>> [1]
> > > > > > > >>
> > > > > > >
> > > > > >
> > > >
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=217386481
> > > > > > > >> > >>> [2] POC:
> > https://github.com/godfreyhe/flink/tree/FLIP-240
> > > > > > > >> > >>>
> > > > > > > >> > >>>
> > > > > > > >> > >>> Best,
> > > > > > > >> > >>> Godfrey
> > > > > > > >>
> > > > > > >
> > > > > >
> > > >
> >

Re: Re: [DISCUSS] FLIP-240: Introduce "ANALYZE TABLE" Syntax

Reply via email to