Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

JingsongLee Tue, 03 Sep 2019 20:01:23 -0700

Thank you for your wonderful points.

I like timo's proposal to enrich built-in functions to flexible function
 modules (For example, the financial model is useful to bank system).
 But I agree with bowen, I don't think hive functions deserves be a
 function module. I think all function modules should be flink built-in
 functions. In this way, we can control their standardization, rather
 than some controversial functions like hive.


About Kurt's concern, yes, every addition of flink's function
 changes user behavior. But in the near future, we'll cover all
 of hive's functions (in the white list). So, if the final form does
 not have hive functions. sooner or later, this behavioral change will
 come. So do we need to let users choose?

Back to the goal of hive built-in, I always thought it was just an
 intermediate solution. Do we need to provide hive built-in
 functions mode to users in the future?

Best,
Jingsong Lee


------------------------------------------------------------------
From:Kurt Young <[email protected]>
Send Time:2019年9月4日(星期三) 10:11
To:dev <[email protected]>
Subject:Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is the
same
as Bowen's. But after thinking about it, I'm currently lean to Timo's
suggestion.

The reason is backward compatibility. If we follow Bowen's approach, let's
say we
first find function in Flink's built-in functions, and then hive's
built-in. For example, `foo`
is not supported by Flink, but hive has such built-in function. So user
will have hive's
behavior for function `foo`. And in next release, Flink realize this is a
very popular function
and add it into Flink's built-in functions, but with different behavior as
hive's. So in next
release, the behavior changes.

With Timo's approach, IIUC user have to tell the framework explicitly what
kind of
built-in functions he would like to use. He can just tell framework to
abandon Flink's built-in
functions, and use hive's instead. User can only choose between them, but
not use
them at the same time. I think this approach is more predictable.

Best,
Kurt


On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <[email protected]> wrote:

> Hi all,
>
> Thanks for the feedback. Just a kindly reminder that the [Proposal] section
> in the google doc was updated, please take a look first and let me know if
> you have more questions.
>
> On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <[email protected]> wrote:
>
> > Hi Timo,
> >
> > Re> 1) We should not have the restriction "hive built-in functions can
> > only
> > > be used when current catalog is hive catalog". Switching a catalog
> > > should only have implications on the cat.db.object resolution but not
> > > functions. It would be quite convinient for users to use Hive built-ins
> > > even if they use a Confluent schema registry or just the in-memory
> > catalog.
> >
> > There might be a misunderstanding here.
> >
> > First of all, Hive built-in functions are not part of Flink built-in
> > functions, they are catalog functions, thus if the current catalog is
> not a
> > HiveCatalog but, say, a schema registry catalog, ambiguous functions
> > reference just shouldn't be resolved to a different catalog.
> >
> > Second, Hive built-in functions can potentially be referenced across
> > catalog, but it doesn't have db namespace and we currently just don't
> have
> > a SQL syntax for it. It can be enabled when such a SQL syntax is defined,
> > e.g. "catalog::function", but it's out of scope of this FLIP.
> >
> > 2) I would propose to have separate concepts for catalog and built-in
> > functions. In particular it would be nice to modularize built-in
> > functions. Some built-in functions are very crucial (like AS, CAST,
> > MINUS), others are more optional but stable (MD5, CONCAT_WS), and maybe
> > we add more experimental functions in the future or function for some
> > special application area (Geo functions, ML functions). A data platform
> > team might not want to make every built-in function available. Or a
> > function module like ML functions is in a different Maven module.
> >
> > I think this is orthogonal to this FLIP, especially we don't have the
> > "external built-in functions" anymore and currently the built-in function
> > category remains untouched.
> >
> > But just to share some thoughts on the proposal, I'm not sure about it:
> > - I don't know if any other databases handle built-in functions like
> that.
> > Maybe you can give some examples? IMHO, built-in functions are system
> info
> > and should be deterministic, not depending on loaded libraries. Geo
> > functions should be either built-in already or just libraries functions,
> > and library functions can be adapted to catalog APIs or of some other
> > syntax to use
> > - I don't know if all use cases stand, and many can be achieved by other
> > approaches too. E.g. experimental functions can be taken good care of by
> > documentations, annotations, etc
> > - the proposal basically introduces some concept like a pluggable
> built-in
> > function catalog, despite the already existing catalog APIs
> > - it brings in even more complicated scenarios to the design. E.g. how do
> > you handle built-in functions in different modules but different names?
> >
> > In short, I'm not sure if it really stands and it looks like an overkill
> > to me. I'd rather not go to that route. Related discussion can be on its
> > own thread.
> >
> > 3) Following the suggestion above, we can have a separate discovery
> > mechanism for built-in functions. Instead of just going through a static
> > list like in BuiltInFunctionDefinitions, a platform team should be able
> > to select function modules like
> > catalogManager.setFunctionModules(CoreFunctions, GeoFunctions,
> > HiveFunctions) or via service discovery;
> >
> > Same as above. I'll leave it to its own thread.
> >
> > re > 3) Dawid and I discussed the resulution order again. I agree with
> > Kurt
> > > that we should unify built-in function (external or internal) under a
> > > common layer. However, the resolution order should be:
> > >   1. built-in functions
> > >   2. temporary functions
> > >   3. regular catalog resolution logic
> > > Otherwise a temporary function could cause clashes with Flink's
> built-in
> > > functions. If you take a look at other vendors, like SQL Server they
> > > also do not allow to overwrite built-in functions.
> >
> > ”I agree with Kurt that we should unify built-in function (external or
> > internal) under a common layer.“ <- I don't think this is what Kurt
> means.
> > Kurt and I are in favor of unifying built-in functions of external
> systems
> > and catalog functions. Did you type a mistake?
> >
> > Besides, I'm not sure about the resolution order you proposed. Temporary
> > functions have a lifespan over a session and are only visible to the
> > session owner, they are unique to each user, and users create them on
> > purpose to be the highest priority in order to overwrite system info
> > (built-in functions in this case).
> >
> > In your case, why would users name a temporary function the same as a
> > built-in function then? Since using that name in ambiguous function
> > reference will always be resolved to built-in functions, creating a
> > same-named temp function would be meaningless in the end.
> >
> >
> > On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <[email protected]> wrote:
> >
> >> Hi Jingsong,
> >>
> >> Re> 1.Hive built-in functions is an intermediate solution. So we should
> >> > not introduce interfaces to influence the framework. To make
> >> > Flink itself more powerful, we should implement the functions
> >> > we need to add.
> >>
> >> Yes, please see the doc.
> >>
> >> Re> 2.Non-flink built-in functions are easy for users to change their
> >> > behavior. If we support some flink built-in functions in the
> >> > future but act differently from non-flink built-in, this will lead to
> >> > changes in user behavior.
> >>
> >> There's no such concept as "external built-in functions" any more.
> >> Built-in functions of external systems will be treated as special
> catalog
> >> functions.
> >>
> >> Re> Another question is, does this fallback include all
> >> > hive built-in functions? As far as I know, some hive functions
> >> > have some hacky. If possible, can we start with a white list?
> >> > Once we implement some functions to flink built-in, we can
> >> > also update the whitelist.
> >>
> >> Yes, that's something we thought of too. I don't think it's super
> >> critical to the scope of this FLIP, thus I'd like to leave it to future
> >> efforts as a nice-to-have feature.
> >>
> >>
> >> On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[email protected]> wrote:
> >>
> >>> Hi Kurt,
> >>>
> >>> Re: > What I want to propose is we can merge #3 and #4, make them both
> >>> under
> >>> >"catalog" concept, by extending catalog function to make it have
> >>> ability to
> >>> >have built-in catalog functions. Some benefits I can see from this
> >>> approach:
> >>> >1. We don't have to introduce new concept like external built-in
> >>> functions.
> >>> >Actually I don't see a full story about how to treat a built-in
> >>> functions, and it
> >>> >seems a little bit disrupt with catalog. As a result, you have to make
> >>> some restriction
> >>> >like "hive built-in functions can only be used when current catalog is
> >>> hive catalog".
> >>>
> >>> Yes, I've unified #3 and #4 but it seems I didn't update some part of
> >>> the doc. I've modified those sections, and they are up to date now.
> >>>
> >>> In short, now built-in function of external systems are defined as a
> >>> special kind of catalog function in Flink, and handled by Flink as
> >>> following:
> >>> - An external built-in function must be associated with a catalog for
> >>> the purpose of decoupling flink-table and external systems.
> >>> - It always resides in front of catalog functions in ambiguous function
> >>> reference order, just like in its own external system
> >>> - It is a special catalog function that doesn’t have a schema/database
> >>> namespace
> >>> - It goes thru the same instantiation logic as other user defined
> >>> catalog functions in the external system
> >>>
> >>> Please take another look at the doc, and let me know if you have more
> >>> questions.
> >>>
> >>>
> >>> On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[email protected]>
> wrote:
> >>>
> >>>> Hi Kurt,
> >>>>
> >>>> it should not affect the functions and operations we currently have in
> >>>> SQL. It just categorizes the available built-in functions. It is kind
> >>>> of
> >>>> an orthogonal concept to the catalog API but built-in functions
> deserve
> >>>> this special kind of treatment. CatalogFunction still fits perfectly
> in
> >>>> there because the regular catalog object resolution logic is not
> >>>> affected. So tables and functions are resolved in the same way but
> with
> >>>> built-in functions that have priority as in the original design.
> >>>>
> >>>> Regards,
> >>>> Timo
> >>>>
> >>>>
> >>>> On 03.09.19 15:26, Kurt Young wrote:
> >>>> > Does this only affect the functions and operations we currently have
> >>>> in SQL
> >>>> > and
> >>>> > have no effect on tables, right? Looks like this is an orthogonal
> >>>> concept
> >>>> > with Catalog?
> >>>> > If the answer are both yes, then the catalog function will be a
> weird
> >>>> > concept?
> >>>> >
> >>>> > Best,
> >>>> > Kurt
> >>>> >
> >>>> >
> >>>> > On Tue, Sep 3, 2019 at 8:10 PM Danny Chan <[email protected]>
> >>>> wrote:
> >>>> >
> >>>> >> The way you proposed are basically the same as what Calcite does, I
> >>>> think
> >>>> >> we are in the same line.
> >>>> >>
> >>>> >> Best,
> >>>> >> Danny Chan
> >>>> >> 在 2019年9月3日 +0800 PM7:57，Timo Walther <[email protected]>，写道：
> >>>> >>> This sounds exactly as the module approach I mentioned, no?
> >>>> >>>
> >>>> >>> Regards,
> >>>> >>> Timo
> >>>> >>>
> >>>> >>> On 03.09.19 13:42, Danny Chan wrote:
> >>>> >>>> Thanks Bowen for bring up this topic, I think it’s a useful
> >>>> >> refactoring to make our function usage more user friendly.
> >>>> >>>> For the topic of how to organize the builtin operators and
> >>>> operators
> >>>> >> of Hive, here is a solution from Apache Calcite, the Calcite way is
> >>>> to make
> >>>> >> every dialect operators a “Library”, user can specify which
> >>>> libraries they
> >>>> >> want to use for a sql query. The builtin operators always comes as
> >>>> the
> >>>> >> first class objects and the others are used from the order they
> >>>> appears.
> >>>> >> Maybe you can take a reference.
> >>>> >>>> [1]
> >>>> >>
> >>>>
> https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28
> >>>> >>>> Best,
> >>>> >>>> Danny Chan
> >>>> >>>> 在 2019年8月28日 +0800 AM2:50，Bowen Li <[email protected]>，写道：
> >>>> >>>>> Hi folks,
> >>>> >>>>>
> >>>> >>>>> I'd like to kick off a discussion on reworking Flink's
> >>>> >> FunctionCatalog.
> >>>> >>>>> It's critically helpful to improve function usability in SQL.
> >>>> >>>>>
> >>>> >>>>>
> >>>> >>
> >>>>
> https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing
> >>>> >>>>> In short, it:
> >>>> >>>>> - adds support for precise function reference with
> fully/partially
> >>>> >>>>> qualified name
> >>>> >>>>> - redefines function resolution order for ambiguous function
> >>>> >> reference
> >>>> >>>>> - adds support for Hive's rich built-in functions (support for
> >>>> Hive
> >>>> >> user
> >>>> >>>>> defined functions was already added in 1.9.0)
> >>>> >>>>> - clarifies the concept of temporary functions
> >>>> >>>>>
> >>>> >>>>> Would love to hear your thoughts.
> >>>> >>>>>
> >>>> >>>>> Bowen
> >>>> >>>
> >>>>
> >>>>
>

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Reply via email to