Thanks Leonard for your reminder, The FLIP title has been changed as FLIP-377: Support fine-grained configuration to control filter push down for Table/SQL Sources.
Best, Jiabao On 2024/01/03 06:51:10 Leonard Xu wrote: > Thanks Jiabao for driving this. > > +1 to start a vote, a minor comment, should we change the FLIP title > according this thread context as well? > > Best, > Leonard > > > > > 2024年1月3日 下午2:43,Jiabao Sun <[email protected]> 写道: > > > > Hi, > > > > Thank you again for the discussion on this FLIP. > > If there are no further comments, I plan to start a voting thread tomorrow. > > > > Best, > > Jiabao > > > > On 2023/12/20 14:09:49 Jiabao Sun wrote: > >> Hi, > >> > >> Thank you to everyone for the discussion on this FLIP, > >> especially Becket for providing guidance that made it more reasonable. > >> > >> The FLIP document[1] has been updated with the recent discussed content. > >> Please take a look to double-check it when you have time. > >> > >> If we can reach a consensus on this, I will open the voting thread in > >> recent days. > >> > >> Best, > >> Jiabao > >> > >> [1] > >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=276105768 > >> > >> > >>> 2023年12月20日 11:38,Jiabao Sun <[email protected]> 写道: > >>> > >>> Thanks Becket, > >>> > >>> The behavior description has been added to the Public Interfaces section. > >>> > >>> Best, > >>> Jiabao > >>> > >>> > >>>> 2023年12月20日 08:17,Becket Qin <[email protected] <http://gmail.com/>> 写道: > >>>> > >>>> Hi Jiabao, > >>>> > >>>> Thanks for updating the FLIP. > >>>> Can you add the behavior of the policies that are only applicable to some > >>>> but not all of the databases? This is a part of the intended behavior of > >>>> the proposed configuration. So, we should include that in the FLIP. > >>>> Otherwise, the FLIP looks good to me. > >>>> > >>>> Cheers, > >>>> > >>>> Jiangjie (Becket) Qin > >>>> > >>>> On Tue, Dec 19, 2023 at 11:00 PM Jiabao Sun <[email protected]> > >>>> wrote: > >>>> > >>>>> Hi Becket, > >>>>> > >>>>> I share the same view as you regarding the prefix for this configuration > >>>>> option. > >>>>> > >>>>> For the JDBC connector, I prefer setting 'filter.handling.policy' = > >>>>> 'FOO' > >>>>> and throwing an exception when the database do not support that specific > >>>>> policy. > >>>>> > >>>>> Not using a prefix can reduce the learning curve for users and avoid > >>>>> introducing a new set of configuration options for every supported JDBC > >>>>> database. > >>>>> I think the policies we provide can be compatible with most databases > >>>>> that > >>>>> follow the JDBC protocol. > >>>>> However, there may be cases where certain databases cannot support some > >>>>> policies. > >>>>> Nevertheless, we can ensure fast failure and allow users to choose a > >>>>> suitable policy in such situations. > >>>>> > >>>>> I have removed the contents about the configuration prefix. > >>>>> Please help review it again. > >>>>> > >>>>> Thanks, > >>>>> Jiabao > >>>>> > >>>>> > >>>>>> 2023年12月19日 19:46,Becket Qin <[email protected] <http://gmail.com/>> 写道: > >>>>>> > >>>>>> Hi Jiabao, > >>>>>> > >>>>>> Thanks for updating the FLIP. > >>>>>> > >>>>>> One more question regarding the JDBC connector, since it is a connector > >>>>>> shared by multiple databases, what if there is a filter handling policy > >>>>>> that is only applicable to one of the databases, but not the others? In > >>>>>> that case, how would the users specify that policy? > >>>>>> Unlike the example of orc format with 2nd+ level config, JDBC connector > >>>>>> only looks at the URL to decide which driver to use. > >>>>>> > >>>>>> For example, MySql supports policy FOO while other databases do not. If > >>>>>> users want to use FOO for MySql, what should they do? Will they set > >>>>>> '*mysql.filter.hanlding.policy' > >>>>>> = 'FOO', *which will only be picked up when the MySql driver is used? > >>>>>> Or they should just set* 'filter.handling.policy' = 'FOO', *and throw > >>>>>> exceptions when other JDBC drivers are used? Personally, I prefer the > >>>>>> latter. If we pick that, do we still need to mention the following? > >>>>>> > >>>>>> *The prefix is needed when the option is for a 2nd+ level. * > >>>>>>> 'connector' = 'filesystem', > >>>>>>> 'format' = 'orc', > >>>>>>> 'orc.filter.handling.policy' = 'NUBERIC_ONY' > >>>>>>> > >>>>>>> *In this case, the values of this configuration may be different > >>>>> depending > >>>>>>> on the format option. For example, orc format may have INDEXED_ONLY > >>>>> while > >>>>>>> parquet format may have something else. * > >>>>>>> > >>>>>> > >>>>>> I found this is somewhat misleading, because the example here is not a > >>>>> part > >>>>>> of the proposal of this FLIP. It is just an example explaining when a > >>>>>> prefix is needed, which seems orthogonal to the proposal in this FLIP. > >>>>>> > >>>>>> Thanks, > >>>>>> > >>>>>> Jiangjie (Becket) Qin > >>>>>> > >>>>>> > >>>>>> On Tue, Dec 19, 2023 at 10:09 AM Jiabao Sun <[email protected] > >>>>>> <[email protected]> > >>>>> .invalid> > >>>>>> wrote: > >>>>>> > >>>>>>> Thanks Becket for the suggestions, > >>>>>>> > >>>>>>> Updated. > >>>>>>> Please help review it again when you have time. > >>>>>>> > >>>>>>> Best, > >>>>>>> Jiabao > >>>>>>> > >>>>>>> > >>>>>>>> 2023年12月19日 09:06,Becket Qin <[email protected] <http://gmail.com/>> > >>>>>>>> 写道: > >>>>>>>> > >>>>>>>> Hi JIabao, > >>>>>>>> > >>>>>>>> Thanks for updating the FLIP. It looks better. Some suggestions / > >>>>>>> questions: > >>>>>>>> > >>>>>>>> 1. In the motivation section: > >>>>>>>> > >>>>>>>>> *Currently, Flink Table/SQL does not expose fine-grained control for > >>>>>>> users > >>>>>>>>> to control filter pushdown. **However, filter pushdown has some side > >>>>>>>>> effects, such as additional computational pressure on external > >>>>>>>>> systems. Moreover, Improper queries can lead to issues such as full > >>>>>>> table > >>>>>>>>> scans, which in turn can impact the stability of external systems.* > >>>>>>>> > >>>>>>>> This statement sounds like the side effects are there for all the > >>>>>>> systems, > >>>>>>>> which is inaccurate. Maybe we can say: > >>>>>>>> *Currently, Flink Table/SQL does not expose fine-grained control for > >>>>>>> users > >>>>>>>> to control filter pushdown. **However, filter pushdown may have side > >>>>>>>> effects in some cases, **such as additional computational pressure on > >>>>>>>> external systems. The JDBC source is a typical example of that. If a > >>>>>>> filter > >>>>>>>> is pushed down to the database, an expensive full table scan may > >>>>>>>> happen > >>>>>>> if > >>>>>>>> the filter involves unindexed columns.* > >>>>>>>> > >>>>>>>> 2. Regarding the prefix, usually a prefix is not required for the top > >>>>>>> level > >>>>>>>> connector options. This is because the *connector* option is already > >>>>>>> there. > >>>>>>>> So > >>>>>>>> 'connector' = 'jdbc', > >>>>>>>> 'filter.handling.policy' = 'ALWAYS' > >>>>>>>> is sufficient. > >>>>>>>> > >>>>>>>> The prefix is needed when the option is for a 2nd+ level. For > >>>>>>>> example, > >>>>>>>> 'connector' = 'jdbc', > >>>>>>>> 'format' = 'orc', > >>>>>>>> 'orc.some.option' = 'some_value' > >>>>>>>> In this case, the prefix of "orc" is needed to make it clear this > >>>>> option > >>>>>>> is > >>>>>>>> for the format. > >>>>>>>> > >>>>>>>> I am guessing that the reason that currently the connector prefix is > >>>>>>> there > >>>>>>>> is because the values of this configuration may be different > >>>>>>>> depending > >>>>> on > >>>>>>>> the connectors. For example, jdbc may have INDEXED_ONLY while MongoDB > >>>>> may > >>>>>>>> have something else. Personally speaking, I am fine if we do not > >>>>>>>> have a > >>>>>>>> prefix in this case because users have already specified the > >>>>>>>> connector > >>>>>>> type > >>>>>>>> and it is intuitive enough that the option value is for that > >>>>>>>> connector, > >>>>>>> not > >>>>>>>> others. > >>>>>>>> > >>>>>>>> 3. can we clarify on the following statement: > >>>>>>>> > >>>>>>>>> *Introduce the native configuration [prefix].filter.handling.policy > >>>>>>>>> in > >>>>>>> the > >>>>>>>>> connector.* > >>>>>>>> > >>>>>>>> What do you mean by "native configuration"? From what I understand, > >>>>>>>> the > >>>>>>>> FLIP does the following: > >>>>>>>> - introduces a new configuration to the JDBC and MongoDB connector. > >>>>>>>> - Suggests a convention option name if other connectors are going to > >>>>> add > >>>>>>> an > >>>>>>>> option for the same purpose. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> > >>>>>>>> Jiangjie (Becket) Qin > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Mon, Dec 18, 2023 at 5:45 PM Jiabao Sun <[email protected] > >>>>>>>> <[email protected]> > >>>>>>> .invalid> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Hi Becket, > >>>>>>>>> > >>>>>>>>> The FLIP document[1] has been updated. > >>>>>>>>> Could you help take a look again? > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> Jiabao > >>>>>>>>> > >>>>>>>>> [1] > >>>>>>>>> > >>>>>>> > >>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=276105768 > >>>>> > >>>>> <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=276105768> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> 2023年12月18日 16:53,Becket Qin <[email protected] <http://gmail.com/>> > >>>>>>>>>> 写道: > >>>>>>>>>> > >>>>>>>>>> Yes, that sounds reasonable to me. We can start with ALWAYS and > >>>>> NEVER, > >>>>>>>>> and > >>>>>>>>>> add more policies as needed. > >>>>>>>>>> > >>>>>>>>>> Thanks, > >>>>>>>>>> > >>>>>>>>>> Jiangjie (Becket) Qin > >>>>>>>>>> > >>>>>>>>>> On Mon, Dec 18, 2023 at 4:48 PM Jiabao Sun > >>>>>>>>>> <[email protected] <[email protected]> > >>>>>>>>> .invalid> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Thanks Bucket, > >>>>>>>>>>> > >>>>>>>>>>> The jdbc.filter.handling.policy is good to me as it provides > >>>>>>> sufficient > >>>>>>>>>>> extensibility for future filter pushdown optimizations. > >>>>>>>>>>> However, currently, we don't have an implementation for the AUTO > >>>>> mode, > >>>>>>>>> and > >>>>>>>>>>> it seems that the AUTO mode can easily be confused with the ALWAYS > >>>>>>> mode > >>>>>>>>>>> because users don't have the opportunity to MANUALLY decide which > >>>>>>>>> filters > >>>>>>>>>>> to push down. > >>>>>>>>>>> > >>>>>>>>>>> I suggest that we only introduce the ALWAYS and NEVER modes for > >>>>>>>>>>> now, > >>>>>>> and > >>>>>>>>>>> we can consider introducing more flexible policies in the future, > >>>>>>>>>>> such as INDEX_ONLY, NUMBERIC_ONLY and so on. > >>>>>>>>>>> > >>>>>>>>>>> WDYT? > >>>>>>>>>>> > >>>>>>>>>>> Best, > >>>>>>>>>>> Jiabao > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> 2023年12月18日 16:27,Becket Qin <[email protected] > >>>>>>>>>>>> <http://gmail.com/>> 写道: > >>>>>>>>>>>> > >>>>>>>>>>>> Hi Jiabao, > >>>>>>>>>>>> > >>>>>>>>>>>> Please see the reply inline. > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> The MySQL connector is currently in the flink-connector-jdbc > >>>>>>>>> repository > >>>>>>>>>>>>> and is not a standalone connector. > >>>>>>>>>>>>> Is it too unique to use "mysql" as the configuration option > >>>>> prefix? > >>>>>>>>>>>> > >>>>>>>>>>>> If the intended behavior makes sense to all the supported JDBC > >>>>>>> drivers, > >>>>>>>>>>> we > >>>>>>>>>>>> can make this a JDBC connector configuration. > >>>>>>>>>>>> > >>>>>>>>>>>> Also, I would like to ask about the difference in behavior > >>>>>>>>>>>> between > >>>>>>> AUTO > >>>>>>>>>>> and > >>>>>>>>>>>>> ALWAYS. > >>>>>>>>>>>>> It seems that we cannot guarantee the pushing down of all > >>>>>>>>>>>>> filters > >>>>> to > >>>>>>>>> the > >>>>>>>>>>>>> external system under the ALWAYS > >>>>>>>>>>>>> mode because not all filters in Flink SQL are supported by the > >>>>>>>>> external > >>>>>>>>>>>>> system. > >>>>>>>>>>>>> Should we throw an error when encountering a filter that cannot > >>>>>>>>>>>>> be > >>>>>>>>>>> pushed > >>>>>>>>>>>>> down in the ALWAYS mode? > >>>>>>>>>>>> > >>>>>>>>>>>> The idea of AUTO is to do efficiency-aware pushdowns. The source > >>>>> will > >>>>>>>>>>> query > >>>>>>>>>>>> the external system (MySQL, Oracle, SQL Server, etc) first to > >>>>>>> retrieve > >>>>>>>>>>> the > >>>>>>>>>>>> information of the table. With that information, the source will > >>>>>>> decide > >>>>>>>>>>>> whether to further push a filter to the external system based on > >>>>> the > >>>>>>>>>>>> efficiency. E.g. only push the indexed fields. In contrast, > >>>>>>>>>>>> ALWAYS > >>>>>>> will > >>>>>>>>>>>> just always push the supported filters to the external system, > >>>>>>>>> regardless > >>>>>>>>>>>> of the efficiency. In case there are filters that are not > >>>>> supported, > >>>>>>>>>>>> according to the current contract of SupportsFilterPushdown, > >>>>>>>>>>>> these > >>>>>>>>>>>> unsupported filters should just be returned by the > >>>>>>>>>>>> *SupportsFilterPushdown.applyFilters()* method as remaining > >>>>> filters. > >>>>>>>>>>>> Therefore, there is no need to throw exceptions here. This is > >>>>> likely > >>>>>>>>> the > >>>>>>>>>>>> desired behavior for most users, IMO. If there are cases that > >>>>>>>>>>>> users > >>>>>>>>>>> really > >>>>>>>>>>>> want to get alerted when a filter cannot be pushed to the > >>>>>>>>>>>> external > >>>>>>>>>>> system, > >>>>>>>>>>>> we can add another value like "ENFORCED_ALWAYS", which behaves > >>>>>>>>>>>> like > >>>>>>>>>>> ALWAYS, > >>>>>>>>>>>> but throws exceptions when a filter cannot be applied to the > >>>>> external > >>>>>>>>>>>> system. But personally I don't see much value in doing this. > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks, > >>>>>>>>>>>> > >>>>>>>>>>>> Jiangjie (Becket) Qin > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> On Mon, Dec 18, 2023 at 3:54 PM Jiabao Sun < > >>>>> [email protected] <[email protected]> > >>>>>>>>>>> .invalid> > >>>>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Hi Becket, > >>>>>>>>>>>>> > >>>>>>>>>>>>> The MySQL connector is currently in the flink-connector-jdbc > >>>>>>>>> repository > >>>>>>>>>>>>> and is not a standalone connector. > >>>>>>>>>>>>> Is it too unique to use "mysql" as the configuration option > >>>>> prefix? > >>>>>>>>>>>>> > >>>>>>>>>>>>> Also, I would like to ask about the difference in behavior > >>>>>>>>>>>>> between > >>>>>>>>> AUTO > >>>>>>>>>>>>> and ALWAYS. > >>>>>>>>>>>>> It seems that we cannot guarantee the pushing down of all > >>>>>>>>>>>>> filters > >>>>> to > >>>>>>>>> the > >>>>>>>>>>>>> external system under the ALWAYS > >>>>>>>>>>>>> mode because not all filters in Flink SQL are supported by the > >>>>>>>>> external > >>>>>>>>>>>>> system. > >>>>>>>>>>>>> Should we throw an error when encountering a filter that cannot > >>>>>>>>>>>>> be > >>>>>>>>>>> pushed > >>>>>>>>>>>>> down in the ALWAYS mode? > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>> Jiabao > >>>>>>>>>>>>> > >>>>>>>>>>>>>> 2023年12月18日 15:34,Becket Qin <[email protected] > >>>>>>>>>>>>>> <http://gmail.com/>> 写道: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Hi JIabao, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks for updating the FLIP. Maybe I did not explain it > >>>>>>>>>>>>>> clearly > >>>>>>>>>>> enough. > >>>>>>>>>>>>> My > >>>>>>>>>>>>>> point is that given there are various good flavors of behaviors > >>>>>>>>>>> handling > >>>>>>>>>>>>>> filters pushed down, we should not have a common config of > >>>>>>>>>>>>>> "ignore.filter.pushdown", because the behavior is not *common*. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> It looks like the original motivation of this FLIP is just for > >>>>>>> MySql. > >>>>>>>>>>>>> Let's > >>>>>>>>>>>>>> focus on what is the best solution for MySql connector here > >>>>> first. > >>>>>>>>>>> After > >>>>>>>>>>>>>> that, if people think the best behavior for MySql happens to > >>>>>>>>>>>>>> be a > >>>>>>>>>>> common > >>>>>>>>>>>>>> one, we can then discuss whether that is worth being added to > >>>>>>>>>>>>>> the > >>>>>>>>> base > >>>>>>>>>>>>>> implementation of source. For MySQL, if we are going to > >>>>> introduce a > >>>>>>>>>>>>> config > >>>>>>>>>>>>>> to MySql, why not have something like > >>>>>>> "mysql.filter.handling.policy" > >>>>>>>>>>> with > >>>>>>>>>>>>>> value of AUTO / NEVER / ALWAYS? Isn't that better than > >>>>>>>>>>>>>> "ignore.filter.pushdown"? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Jiangjie (Becket) Qin > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > [message truncated...]
