Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Bowen Li Tue, 03 Sep 2019 17:00:20 -0700

Hi all,

Thanks for the feedback. Just a kindly reminder that the [Proposal] section
in the google doc was updated, please take a look first and let me know if
you have more questions.


On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <[email protected]> wrote:

> Hi Timo,
>
> Re> 1) We should not have the restriction "hive built-in functions can
> only
> > be used when current catalog is hive catalog". Switching a catalog
> > should only have implications on the cat.db.object resolution but not
> > functions. It would be quite convinient for users to use Hive built-ins
> > even if they use a Confluent schema registry or just the in-memory
> catalog.
>
> There might be a misunderstanding here.
>
> First of all, Hive built-in functions are not part of Flink built-in
> functions, they are catalog functions, thus if the current catalog is not a
> HiveCatalog but, say, a schema registry catalog, ambiguous functions
> reference just shouldn't be resolved to a different catalog.
>
> Second, Hive built-in functions can potentially be referenced across
> catalog, but it doesn't have db namespace and we currently just don't have
> a SQL syntax for it. It can be enabled when such a SQL syntax is defined,
> e.g. "catalog::function", but it's out of scope of this FLIP.
>
> 2) I would propose to have separate concepts for catalog and built-in
> functions. In particular it would be nice to modularize built-in
> functions. Some built-in functions are very crucial (like AS, CAST,
> MINUS), others are more optional but stable (MD5, CONCAT_WS), and maybe
> we add more experimental functions in the future or function for some
> special application area (Geo functions, ML functions). A data platform
> team might not want to make every built-in function available. Or a
> function module like ML functions is in a different Maven module.
>
> I think this is orthogonal to this FLIP, especially we don't have the
> "external built-in functions" anymore and currently the built-in function
> category remains untouched.
>
> But just to share some thoughts on the proposal, I'm not sure about it:
> - I don't know if any other databases handle built-in functions like that.
> Maybe you can give some examples? IMHO, built-in functions are system info
> and should be deterministic, not depending on loaded libraries. Geo
> functions should be either built-in already or just libraries functions,
> and library functions can be adapted to catalog APIs or of some other
> syntax to use
> - I don't know if all use cases stand, and many can be achieved by other
> approaches too. E.g. experimental functions can be taken good care of by
> documentations, annotations, etc
> - the proposal basically introduces some concept like a pluggable built-in
> function catalog, despite the already existing catalog APIs
> - it brings in even more complicated scenarios to the design. E.g. how do
> you handle built-in functions in different modules but different names?
>
> In short, I'm not sure if it really stands and it looks like an overkill
> to me. I'd rather not go to that route. Related discussion can be on its
> own thread.
>
> 3) Following the suggestion above, we can have a separate discovery
> mechanism for built-in functions. Instead of just going through a static
> list like in BuiltInFunctionDefinitions, a platform team should be able
> to select function modules like
> catalogManager.setFunctionModules(CoreFunctions, GeoFunctions,
> HiveFunctions) or via service discovery;
>
> Same as above. I'll leave it to its own thread.
>
> re > 3) Dawid and I discussed the resulution order again. I agree with
> Kurt
> > that we should unify built-in function (external or internal) under a
> > common layer. However, the resolution order should be:
> >   1. built-in functions
> >   2. temporary functions
> >   3. regular catalog resolution logic
> > Otherwise a temporary function could cause clashes with Flink's built-in
> > functions. If you take a look at other vendors, like SQL Server they
> > also do not allow to overwrite built-in functions.
>
> ”I agree with Kurt that we should unify built-in function (external or
> internal) under a common layer.“ <- I don't think this is what Kurt means.
> Kurt and I are in favor of unifying built-in functions of external systems
> and catalog functions. Did you type a mistake?
>
> Besides, I'm not sure about the resolution order you proposed. Temporary
> functions have a lifespan over a session and are only visible to the
> session owner, they are unique to each user, and users create them on
> purpose to be the highest priority in order to overwrite system info
> (built-in functions in this case).
>
> In your case, why would users name a temporary function the same as a
> built-in function then? Since using that name in ambiguous function
> reference will always be resolved to built-in functions, creating a
> same-named temp function would be meaningless in the end.
>
>
> On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <[email protected]> wrote:
>
>> Hi Jingsong,
>>
>> Re> 1.Hive built-in functions is an intermediate solution. So we should
>> > not introduce interfaces to influence the framework. To make
>> > Flink itself more powerful, we should implement the functions
>> > we need to add.
>>
>> Yes, please see the doc.
>>
>> Re> 2.Non-flink built-in functions are easy for users to change their
>> > behavior. If we support some flink built-in functions in the
>> > future but act differently from non-flink built-in, this will lead to
>> > changes in user behavior.
>>
>> There's no such concept as "external built-in functions" any more.
>> Built-in functions of external systems will be treated as special catalog
>> functions.
>>
>> Re> Another question is, does this fallback include all
>> > hive built-in functions? As far as I know, some hive functions
>> > have some hacky. If possible, can we start with a white list?
>> > Once we implement some functions to flink built-in, we can
>> > also update the whitelist.
>>
>> Yes, that's something we thought of too. I don't think it's super
>> critical to the scope of this FLIP, thus I'd like to leave it to future
>> efforts as a nice-to-have feature.
>>
>>
>> On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[email protected]> wrote:
>>
>>> Hi Kurt,
>>>
>>> Re: > What I want to propose is we can merge #3 and #4, make them both
>>> under
>>> >"catalog" concept, by extending catalog function to make it have
>>> ability to
>>> >have built-in catalog functions. Some benefits I can see from this
>>> approach:
>>> >1. We don't have to introduce new concept like external built-in
>>> functions.
>>> >Actually I don't see a full story about how to treat a built-in
>>> functions, and it
>>> >seems a little bit disrupt with catalog. As a result, you have to make
>>> some restriction
>>> >like "hive built-in functions can only be used when current catalog is
>>> hive catalog".
>>>
>>> Yes, I've unified #3 and #4 but it seems I didn't update some part of
>>> the doc. I've modified those sections, and they are up to date now.
>>>
>>> In short, now built-in function of external systems are defined as a
>>> special kind of catalog function in Flink, and handled by Flink as
>>> following:
>>> - An external built-in function must be associated with a catalog for
>>> the purpose of decoupling flink-table and external systems.
>>> - It always resides in front of catalog functions in ambiguous function
>>> reference order, just like in its own external system
>>> - It is a special catalog function that doesn’t have a schema/database
>>> namespace
>>> - It goes thru the same instantiation logic as other user defined
>>> catalog functions in the external system
>>>
>>> Please take another look at the doc, and let me know if you have more
>>> questions.
>>>
>>>
>>> On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[email protected]> wrote:
>>>
>>>> Hi Kurt,
>>>>
>>>> it should not affect the functions and operations we currently have in
>>>> SQL. It just categorizes the available built-in functions. It is kind
>>>> of
>>>> an orthogonal concept to the catalog API but built-in functions deserve
>>>> this special kind of treatment. CatalogFunction still fits perfectly in
>>>> there because the regular catalog object resolution logic is not
>>>> affected. So tables and functions are resolved in the same way but with
>>>> built-in functions that have priority as in the original design.
>>>>
>>>> Regards,
>>>> Timo
>>>>
>>>>
>>>> On 03.09.19 15:26, Kurt Young wrote:
>>>> > Does this only affect the functions and operations we currently have
>>>> in SQL
>>>> > and
>>>> > have no effect on tables, right? Looks like this is an orthogonal
>>>> concept
>>>> > with Catalog?
>>>> > If the answer are both yes, then the catalog function will be a weird
>>>> > concept?
>>>> >
>>>> > Best,
>>>> > Kurt
>>>> >
>>>> >
>>>> > On Tue, Sep 3, 2019 at 8:10 PM Danny Chan <[email protected]>
>>>> wrote:
>>>> >
>>>> >> The way you proposed are basically the same as what Calcite does, I
>>>> think
>>>> >> we are in the same line.
>>>> >>
>>>> >> Best,
>>>> >> Danny Chan
>>>> >> 在 2019年9月3日 +0800 PM7:57，Timo Walther <[email protected]>，写道：
>>>> >>> This sounds exactly as the module approach I mentioned, no?
>>>> >>>
>>>> >>> Regards,
>>>> >>> Timo
>>>> >>>
>>>> >>> On 03.09.19 13:42, Danny Chan wrote:
>>>> >>>> Thanks Bowen for bring up this topic, I think it’s a useful
>>>> >> refactoring to make our function usage more user friendly.
>>>> >>>> For the topic of how to organize the builtin operators and
>>>> operators
>>>> >> of Hive, here is a solution from Apache Calcite, the Calcite way is
>>>> to make
>>>> >> every dialect operators a “Library”, user can specify which
>>>> libraries they
>>>> >> want to use for a sql query. The builtin operators always comes as
>>>> the
>>>> >> first class objects and the others are used from the order they
>>>> appears.
>>>> >> Maybe you can take a reference.
>>>> >>>> [1]
>>>> >>
>>>> https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28
>>>> >>>> Best,
>>>> >>>> Danny Chan
>>>> >>>> 在 2019年8月28日 +0800 AM2:50，Bowen Li <[email protected]>，写道：
>>>> >>>>> Hi folks,
>>>> >>>>>
>>>> >>>>> I'd like to kick off a discussion on reworking Flink's
>>>> >> FunctionCatalog.
>>>> >>>>> It's critically helpful to improve function usability in SQL.
>>>> >>>>>
>>>> >>>>>
>>>> >>
>>>> https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing
>>>> >>>>> In short, it:
>>>> >>>>> - adds support for precise function reference with fully/partially
>>>> >>>>> qualified name
>>>> >>>>> - redefines function resolution order for ambiguous function
>>>> >> reference
>>>> >>>>> - adds support for Hive's rich built-in functions (support for
>>>> Hive
>>>> >> user
>>>> >>>>> defined functions was already added in 1.9.0)
>>>> >>>>> - clarifies the concept of temporary functions
>>>> >>>>>
>>>> >>>>> Would love to hear your thoughts.
>>>> >>>>>
>>>> >>>>> Bowen
>>>> >>>
>>>>
>>>>

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Reply via email to